CIS275 - Chapter 5: Data Storage Flashcards
the time required to access the first byte in a read or write operation.
Access time
the speed at which data is read or written, following initial access.
Transfer rate
memory that is lost when disconnected from power.
Volatile memory
memory that is retained without power.
Non-volatile memory
the primary memory used when computer programs execute.
Main memory,
also called random-access memory (RAM),
Main memory is fast, expensive, and has limited capacity.
_____ is less expensive and higher capacity than main memory.
Flash memory,
also called solid-state drive (SSD)
Writes to flash memory are much slower than reads, and both are much slower than main memory writes and reads.
Memory used to store large amounts of data.
Magnetic disk,
also called hard-disk drive (HDD)
Magnetic disk is slower, less expensive, and higher capacity than flash memory.
Magnetic disk groups data in _____.
sectors
traditionally 512 bytes per sector but 4 kilobytes with newer disk formats.
Flash memory groups data in _____
pages
usually between 2 kilobytes and 16 kilobytes per page.
Databases and file systems use a uniform size, called a _____, when transferring data between main memory and storage media.
block
Block size is independent of storage media.
Database systems typically support block sizes ranging from 2 kilobytes to 64 kilobytes. Smaller block sizes are usually better for transactional applications, which access a few rows per query. Larger block sizes are usually better for analytic applications, which access many rows per query.
To minimize block transfers, relational databases usually store an entire row within one block, which is called _____.
row-oriented storage
Row-oriented storage performs best when row size is small relative to block size, for two reasons:
Improved query performance. When row size is small relative to block size, each block contains many rows. Queries that read and write multiple rows transfer fewer blocks, resulting in better performance.
Less wasted storage. Row-oriented storage wastes a few bytes per block, since rows do not usually fit evenly into the available space. The wasted space is less than the row size. If row size is small relative to block size, this wasted space is insignificant.
- The row size is small relative to the block size, so many rows are transferred per block.
- Only 16 bytes go unused, so wasted space is insignificant.
- The row size is large relative to block size, so fewer rows are transferred per block.
- One kilobyte goes unused, so wasted space is significant.
- The large column allows only two rows to be transferred per block and wastes significant space.
- Large columns are replaced by a link and are stored in a separate area.
- More rows fit per block, and less space is wasted. Queries that do not access the large column are faster.
In column-oriented storage, also called columnar storage, each block stores values for a single column only.
column-oriented storage,
also called columnar storage
Column-oriented storage benefits analytic applications in several ways:
Faster data access. More column values are transferred per block, reducing time to access storage media.
Better data compression. Databases often apply data compression algorithms when storing data. Data compression is usually more effective when all values have the same data type. As a result, more values are stored per block, which reduces storage and access time.
- With column-oriented storage, a 4 kilobyte block contains roughly 500 8-byte column values.
- Computing the average of all incomes reads fewer blocks than row-oriented storage.
- Different columns occupy separate blocks.
- Selecting multiple columns reads multiple blocks and is slower than row-oriented storage.
a scheme for organizing rows in blocks on storage media.
table structure
Databases commonly support four alternative table structures:
Heap table
Sorted table
Hash table
Table cluster
In a _____, no order is imposed on rows.
heap table
- In a heap table, the database maintains a pointer to free space, which indicates the location of the next insert.
- When a row is inserted, the pointer moves to the next available space.
- When a row is deleted, the pointer moves to the deleted space. The deleted space is linked to free space at the end of the table.
- As more rows are deleted, free space is linked together in a list.
- Inserts go to the first available space in the list.
In a _____, the database designer identifies a _____ that determines physical row order.
sorted table
sort column
- In a sorted table, rows are sorted on a sort column.
- Inserting a new row requires moving all subsequent rows, which is inefficient.
- To avoid inefficient inserts, the sort column order is maintained with links, producing a linked list.
- The sort order is maintained during an insert by changing two links rather than moving rows.
In a _____, rows are assigned to buckets.
hash table
A _____ is a block or group of blocks containing rows.
bucket
Initially, each bucket has one block. As a table grows, some buckets eventually fill up with rows, and the database allocates additional blocks. New blocks are linked to the initial block, and the bucket becomes a chain of linked blocks.
The bucket containing each row is determined by a hash function and a hash key.
The _____ is a column or group of columns, usually the primary key.
hash key
The _____ computes the bucket containing the row from the hash key.
hash function
Hash functions are designed to scramble row locations and evenly distribute rows across blocks. The modulo function is a simple hash function with four steps:
Convert the hash key by interpreting the key’s bits as an integer value.
Divide the integer by the number of buckets.
Interpret the division remainder as the bucket number.
Convert the bucket number to the physical address of the block containing the row.
- FlightNumber is the hash key for the Flight table.
- The hash function is modulo 10, resulting in 10 buckets. Rows are assigned to buckets based on the last digit of the hash key.
- If a bucket fills with rows, additional blocks are allocated and linked to the first block.
A _____ automatically allocates more blocks to the table, creates additional buckets, and distributes rows across all buckets.
dynamic hash function
Expected:
free space A
free space B → free space A
In a heap table, the database maintains a pointer to the first free space in the table. In the above table, free space A is the only free space, so the pointer points to free space A.
Row 659 is deleted, so the free space pointer moves to free space B. Free space B is linked to free space A, the next free space.
Expected:
free space B,
free space A,
free space A,
new block
When row 1961 is deleted, the free space pointer moves to free space B. Free space B is linked to free space A, the next free space.
Because the free space pointer now points to free space B, the first insert is stored in free space B. Then, the free space pointer is reset to point to free space A, so the second insert is stored in free space A. There is still room for another row in free space A, so the third insert is also stored in free space A. Since the block is now full, the free space pointer is set to point to the beginning of the new block, where the fourth insert is stored.
Expected:
23 → 24, 24 → 27
No sort columns
A pointer links a row to the row with the next higher key. Since 23 < 24 < 27, pointers 23 → 24 and 24 → 27 are added to table A.
No columns in table B are sorted, so table B has no sort columns.
Expected:
14 → 20, 20 → 21
Name, State, (Name, State)
A pointer links a row to the row with the next higher key. Since 14 < 20 < 21, pointers 14 → 20 and 20 → 21 are added to table A.
In table B, the Name and State columns are sorted, so either column is a valid sort column.
The sort column can be a group of columns. The Name and State columns are sorted, so the Name and State columns together constitute a valid sort column, regardless of which column contains the primary key values.
Expected:
22 → 24, 24 → 25
No sort columns
A pointer links a row to the row with the next higher key. Since 22 < 24 < 25, pointers 22 → 24 and 24 → 25 are added to table A.
No columns in table B are sorted, so table B has no sort columns.
Consider a hash table with 5 buckets and a modulo function. Which bucket does each hash key map to?
Expected:
0, 3, 1, 4, 4
5 buckets exist, so the hash function is modulo 5.
755 modulo 5 is 0, so bucket 0.
1098 modulo 5 is 3, so bucket 3.
446 modulo 5 is 1, so bucket 1.
1204 modulo 5 is 4, so bucket 4.
1499 modulo 5 is 4, so bucket 4.
Consider a hash table with 5 buckets and a modulo function. Which bucket does each hash key map to?
Expected:
0, 2, 1, 4, 2
5 buckets exist, so the hash function is modulo 5.
1425 modulo 5 is 0, so bucket 0.
302 modulo 5 is 2, so bucket 2.
1321 modulo 5 is 1, so bucket 1.
914 modulo 5 is 4, so bucket 4.
177 modulo 5 is 2, so bucket 2.
Consider a hash table with 5 buckets and a modulo function. Which bucket does each hash key map to?
Expected: 4, 2, 0, 3, 1
5 buckets exist, so the hash function is modulo 5.
184 modulo 5 is 4, so bucket 4.
582 modulo 5 is 2, so bucket 2.
355 modulo 5 is 0, so bucket 0.
433 modulo 5 is 3, so bucket 3.
256 modulo 5 is 1, so bucket 1.
Expected: 231 blocks, 640 rows
(5,600 bytes/block) / (70 bytes/row) = 80 rows/block
So 18,465 rows would need:
18,465 rows / (80 rows/block) = 230 blocks with remainder 65, so 231 blocks.
Each bucket has 1 block, and 1 block has 80 rows. So 8 buckets can have up to 80 × 8 = 640 rows.
Expected: 421 blocks, 630 rows
(6,300 bytes/block) / (70 bytes/row) = 90 rows/block
So 37,825 rows would need:
37,825 rows / (90 rows/block) = 420 blocks with remainder 25, so 421 blocks.
Each bucket has 1 block, and 1 block has 90 rows. So 7 buckets can have up to 90 × 7 = 630 rows.
_____, interleave rows of two or more tables in the same storage area.
Table clusters, also called multi-tables
Table clusters are optimal when joining interleaved tables on the cluster key, since physical row location is the same as output order. Table clusters perform poorly for many other queries:
Join on columns other than cluster key. In a join on a column that is not the cluster key, physical row location is not the same as output order, so the join is slow.
Read multiple rows of a single table. Table clusters spread each table across more blocks than other structures. Queries that read multiple rows may access more blocks.
Update clustering key. Rows may move to different blocks when the clustering key changes.
Table clusters are not optimal for many queries and therefore are not commonly used.
- Airline and Flight tables both contain an AirlineName column.
- AirlineName is the cluster key for the table cluster.
- Rows of Airline and Flight are interleaved on storage media.
Table clusters have a _____, a column that is available in all interleaved tables.
cluster key
The cluster key determines the order in which rows are interleaved. Rows with the same cluster key value are stored together. Usually the cluster key is the primary key of one table and the corresponding foreign key of another, as in the animation below.
a file containing column values, along with pointers to rows containing the column value.
single-level index
- An index is on the DepartureAirport column of the Flight table.
- Each column value appears in the index in sorted order.
- Each column value has a pointer to the row containing the value.
- If the column values are not unique, the index may have multiple entries for some values.
- Alternatively, an index on a non-unique column may have multiple pointers for some values.
In a _____, each index entry is a composite of values from all indexed columns.
multi-column index
In all other respects, multi-column indexes behave exactly like indexes on a single column.
- FlightNumber is the Flight table’s primary key. FlightNumber is sorted but is not indexed.
- The SELECT statement has no WHERE clause, so the hit ratio is high. The database performs a table scan and reads all table blocks.
- The WHERE clause does not contain an indexed column, so hit ratio is high. The database must perform a table scan.
- Only the first two blocks are read because FlightNumber 803 is found in the second table block.
- Since the hit ratio is low and the WHERE clause contains an indexed column, the database performs an index scan.
a database operation that reads table blocks directly, without accessing an index.
table scan
a database operation that reads index blocks sequentially, in order to locate the needed table blocks.
index scan
the percentage of table rows selected by a query.
Hit ratio, also called filter factor or selectivity
When a SELECT query is executed, the database examines the WHERE clause and estimates hit ratio. If hit ratio is high, the database performs a table scan. If hit ratio is low, the query needs only a few table blocks, so a table scan would be inefficient. Instead, the database:
Looks for an indexed column in the WHERE clause.
Scans the index.
Finds values that match the WHERE clause.
Reads the corresponding table blocks.
If the WHERE clause does not contain an indexed column, the database must perform a table scan.
In a _____, the database repeatedly splits the index in two until it finds the entry containing the search value:
binary search
- The database first compares the search value to an entry in the middle of the index.
- If the search value is less than the entry value, the search value is in the first half of the index. If not, the search value is in the second half.
- The database now compares the search value to the entry in the middle of the selected half, to narrow the search to one quarter of the index.
- The database continues in this manner until it finds the index block containing the search value.
- The DepartureAirport index occupies nine blocks. The SELECT query looks for the row containing DepartureAirport = ‘DUB’.
- A binary search first examines the middle, block 4, of the sorted index.
- Since ‘DUB’ precedes ‘ORD’, the binary search resets last to block 3 and middle to block 1.
- Since ‘DUB’ follows ‘BWI’, search resets first and middle to block 2.
- The binary search reads block 2 and finds ‘DUB’. The database follows DUB’s pointer to the table block containing the row.