CIS275 - Chapter 5: Data Storage Flashcards

Question

The _____ is a column or group of columns, usually the primary key.

Answer 1

hash function ## Footnote Hash functions are designed to scramble row locations and evenly distribute rows across blocks. The modulo function is a simple hash function with four steps: Convert the hash key by interpreting the key's bits as an integer value. Divide the integer by the number of buckets. Interpret the division remainder as the bucket number. Convert the bucket number to the physical address of the block containing the row. 1. FlightNumber is the hash key for the Flight table. 2. The hash function is modulo 10, resulting in 10 buckets. Rows are assigned to buckets based on the last digit of the hash key. 3. If a bucket fills with rows, additional blocks are allocated and linked to the first block.

Answer 2

dynamic hash function

Answer 3

Expected: free space A free space B → free space A In a heap table, the database maintains a pointer to the first free space in the table. In the above table, free space A is the only free space, so the pointer points to free space A. Row 659 is deleted, so the free space pointer moves to free space B. Free space B is linked to free space A, the next free space.

Answer 4

Expected: free space B, free space A, free space A, new block When row 1961 is deleted, the free space pointer moves to free space B. Free space B is linked to free space A, the next free space. Because the free space pointer now points to free space B, the first insert is stored in free space B. Then, the free space pointer is reset to point to free space A, so the second insert is stored in free space A. There is still room for another row in free space A, so the third insert is also stored in free space A. Since the block is now full, the free space pointer is set to point to the beginning of the new block, where the fourth insert is stored.

Answer 5

Expected: 23 → 24, 24 → 27 No sort columns A pointer links a row to the row with the next higher key. Since 23 \< 24 \< 27, pointers 23 → 24 and 24 → 27 are added to table A. No columns in table B are sorted, so table B has no sort columns.

Answer 6

Expected: 14 → 20, 20 → 21 Name, State, (Name, State) A pointer links a row to the row with the next higher key. Since 14 \< 20 \< 21, pointers 14 → 20 and 20 → 21 are added to table A. In table B, the Name and State columns are sorted, so either column is a valid sort column. The sort column can be a group of columns. The Name and State columns are sorted, so the Name and State columns together constitute a valid sort column, regardless of which column contains the primary key values.

Answer 7

Expected: 22 → 24, 24 → 25 No sort columns A pointer links a row to the row with the next higher key. Since 22 \< 24 \< 25, pointers 22 → 24 and 24 → 25 are added to table A. No columns in table B are sorted, so table B has no sort columns.

Answer 8

Expected: 0, 3, 1, 4, 4 5 buckets exist, so the hash function is modulo 5. 755 modulo 5 is 0, so bucket 0. 1098 modulo 5 is 3, so bucket 3. 446 modulo 5 is 1, so bucket 1. 1204 modulo 5 is 4, so bucket 4. 1499 modulo 5 is 4, so bucket 4.

Answer 9

Expected: 0, 2, 1, 4, 2 5 buckets exist, so the hash function is modulo 5. 1425 modulo 5 is 0, so bucket 0. 302 modulo 5 is 2, so bucket 2. 1321 modulo 5 is 1, so bucket 1. 914 modulo 5 is 4, so bucket 4. 177 modulo 5 is 2, so bucket 2.

Answer 10

Expected: 4, 2, 0, 3, 1 5 buckets exist, so the hash function is modulo 5. 184 modulo 5 is 4, so bucket 4. 582 modulo 5 is 2, so bucket 2. 355 modulo 5 is 0, so bucket 0. 433 modulo 5 is 3, so bucket 3. 256 modulo 5 is 1, so bucket 1.

Answer 11

Expected: 231 blocks, 640 rows (5,600 bytes/block) / (70 bytes/row) = 80 rows/block So 18,465 rows would need: 18,465 rows / (80 rows/block) = 230 blocks with remainder 65, so 231 blocks. Each bucket has 1 block, and 1 block has 80 rows. So 8 buckets can have up to 80 × 8 = 640 rows.

Answer 12

Expected: 421 blocks, 630 rows (6,300 bytes/block) / (70 bytes/row) = 90 rows/block So 37,825 rows would need: 37,825 rows / (90 rows/block) = 420 blocks with remainder 25, so 421 blocks. Each bucket has 1 block, and 1 block has 90 rows. So 7 buckets can have up to 90 × 7 = 630 rows.

Answer 13

Table clusters, also called multi-tables Table clusters are optimal when joining interleaved tables on the cluster key, since physical row location is the same as output order. Table clusters perform poorly for many other queries: Join on columns other than cluster key. In a join on a column that is not the cluster key, physical row location is not the same as output order, so the join is slow. Read multiple rows of a single table. Table clusters spread each table across more blocks than other structures. Queries that read multiple rows may access more blocks. Update clustering key. Rows may move to different blocks when the clustering key changes. Table clusters are not optimal for many queries and therefore are not commonly used. 1. Airline and Flight tables both contain an AirlineName column. 2. AirlineName is the cluster key for the table cluster. 3. Rows of Airline and Flight are interleaved on storage media.

Answer 14

cluster key The cluster key determines the order in which rows are interleaved. Rows with the same cluster key value are stored together. Usually the cluster key is the primary key of one table and the corresponding foreign key of another, as in the animation below.

Answer 15

single-level index ## Footnote 1. An index is on the DepartureAirport column of the Flight table. 2. Each column value appears in the index in sorted order. 3. Each column value has a pointer to the row containing the value. 4. If the column values are not unique, the index may have multiple entries for some values. 5. Alternatively, an index on a non-unique column may have multiple pointers for some values.

Answer 16

multi-column index In all other respects, multi-column indexes behave exactly like indexes on a single column. 1. FlightNumber is the Flight table's primary key. FlightNumber is sorted but is not indexed. 2. The SELECT statement has no WHERE clause, so the hit ratio is high. The database performs a table scan and reads all table blocks. 3. The WHERE clause does not contain an indexed column, so hit ratio is high. The database must perform a table scan. 4. Only the first two blocks are read because FlightNumber 803 is found in the second table block. 5. Since the hit ratio is low and the WHERE clause contains an indexed column, the database performs an index scan.

Answer 17

table scan

Answer 18

index scan

Answer 19

**Hit ratio**, also called **filter factor** or **selectivity** When a SELECT query is executed, the database examines the WHERE clause and estimates hit ratio. If hit ratio is high, the database performs a table scan. If hit ratio is low, the query needs only a few table blocks, so a table scan would be inefficient. Instead, the database: Looks for an indexed column in the WHERE clause. Scans the index. Finds values that match the WHERE clause. Reads the corresponding table blocks. If the WHERE clause does not contain an indexed column, the database must perform a table scan.

Answer 20

binary search ## Footnote 1. The database first compares the search value to an entry in the middle of the index. 2. If the search value is less than the entry value, the search value is in the first half of the index. If not, the search value is in the second half. 3. The database now compares the search value to the entry in the middle of the selected half, to narrow the search to one quarter of the index. 4. The database continues in this manner until it finds the index block containing the search value. 1. The DepartureAirport index occupies nine blocks. The SELECT query looks for the row containing DepartureAirport = 'DUB'. 2. A binary search first examines the middle, block 4, of the sorted index. 3. Since 'DUB' precedes 'ORD', the binary search resets last to block 3 and middle to block 1. 4. Since 'DUB' follows 'BWI', search resets first and middle to block 2. 5. The binary search reads block 2 and finds 'DUB'. The database follows DUB's pointer to the table block containing the row.

Answer 21

primary index

Answer 22

clustering index

Answer 23

secondary index

Answer 24

dense index

Answer 25

sparse index The table is not sorted on DepartureAirport column. The index on DepartureAirport must be dense. The table is sorted on DepartureAirport column. The index on the sort column may be sparse. 'LAX' does not appear in the index, but the block containing 'LAX' can be determined from sort order.

Answer 26

1. A new row is inserted into the table with a dense sorted index. A new index entry 'PIT' must be inserted in sort order. 2. Since the index block is full, the block splits. Half of the existing entries are moved to the new block. 3. The block split creates space for the new index entry.

Answer 27

Expected: (k) Index entry (k) contains the value SEA with a pointer to the row that refers to Delta Airlines flight 1222.

Answer 28

Expected: between the HKG and LAX entries Since the single-level index is sorted, an entry for JFK goes between the HKG and LAX entries. The database must create space for the new entry.

Answer 29

Expected: between the LAX and ORD entries Since the single-level index is sorted, an entry for NRT goes between the LAX and ORD entries. The database must create space for the new entry.

Answer 30

Expected: 251, 9000 The database reads at most 250 index blocks, plus one table block containing the row for flight 1695. With no index, the database performs a table scan and reads at most all 9000 table blocks.

Answer 31

Expected: 201, 5000 The database reads at most 200 index blocks, plus one table block containing the row for flight 1062. With no index, the database performs a table scan and reads at most all 5000 table blocks.

Answer 32

Expected: 225 seconds, 4500000 blocks (900,000,000 rows) × (500 bytes/row) × (1 gigabyte/1,000,000,000 bytes) / (2 gigabytes/sec) = 225 seconds Each index block contains 8 kb × (1000 bytes/kb) / 40 bytes/entry = 200 entries. The index has (900,000,000 entries) / (200 entries/block) = 4,500,000 blocks. A binary search requires log(2)4,500,000 blocks.

Answer 33

Expected: 640 seconds, 20000000 blocks (800,000,000 rows) × (400 bytes/row) × (1 gigabyte/1,000,000,000 bytes) / (0.5 gigabytes/sec) = 640 seconds Each index block contains 4 kb × (1000 bytes/kb) / 100 bytes/entry = 40 entries. The index has (800,000,000 entries) / (40 entries/block) = 20,000,000 blocks. A binary search requires log(2) 20,000,000 blocks.

Answer 34

Expected: 2000 seconds, 5000000 blocks (400,000,000 rows) × (500 bytes/row) × (1 gigabyte/1,000,000,000 bytes) / (0.1 gigabytes/sec) = 2000 seconds Each index block contains 8 kb × (1000 bytes/kb) / 100 bytes/entry = 80 entries. The index has (400,000,000 entries) / (80 entries/block) = 5,000,000 blocks. A binary search requires log(2) 5,000,000 blocks.

Answer 35

Expected: 300 seconds, 375000 blocks (600,000,000 rows) × (500 bytes/row) × (1 gigabyte/1,000,000,000 bytes) / (1 gigabyte/sec) = 300 seconds Each index block contains 16 kb × (1000 bytes/kb) / 10 bytes/entry = 1600 entries. The index has (600,000,000 entries) / (1600 entries/block) = 375,000 blocks. A binary search requires log(2) 375,000 blocks.

Answer 36

multi-level index The bottom level of the hierarchy is a sorted single-level index. The bottom level is sparse for primary and clustering indexes, or dense for secondary indexes. 1. Flight table is sorted on non-unique column DepartureAirport. 2. The bottom level is a sparse index on DepartureAirport. If the table is not sorted on the index column, the bottom level must be dense. 3. The next higher level is a sparse index to the lower level.

Answer 37

fan-out The number of levels in a multi-level index can be computed from fan-out, number of rows, and rows per block: For a dense index, number of levels = logfan-out (number of rows) For a sparse index, number of levels = logfan-out (number of rows / rows per block)

Answer 38

1. A query searches for rows containing 'FRA'. 2. The database reads the index's top level. 'FRA' is between 'ATL' and 'MIN'. 3. The database follows the 'ATL' pointer and reads the bottom-level block. 'FRA' is between 'FLL' and 'GRB'. 4. The database follows the 'FLL' pointer, reads the table block, and locates the row containing 'FRA'.

Answer 39

balanced ## Footnote Imbalanced indexes are undesirable, since processing time is unpredictable. If a query follows a long branch, the query is relatively slow. For this reason, inserts are managed to maintain balanced indexes: 1. In a dense index, inserts always generate new bottom-level index entries. In a sparse index, inserts generate new bottom-level index entries when table blocks split. 2. If the new index entry goes in a full index block, the block splits. Half of the rows move to the new block, creating space for the entry. 3. The new block in the bottom level generates a new index entry the next level up. If block in the next level up is full, the block splits and the process repeats. 4. If blocks are full at all index levels, the split propagates to the top level. In this case, the top-level block splits and a new level is created. 1. The Flight table is sorted on DepartureAirport. A row is inserted with DepartureAirport 'FSK'. 2. The table block is full and splits to create space for the new row. 3. The new table block generates a new entry in the index's bottom level. 4. The bottom-level index block is full and splits. 5. The new bottom-level block generates a new entry for the index's top level. 6. The top-level index is full and splits. A new top level is created containing entries for two blocks in the middle level.

Answer 40

imbalanced Imbalanced indexes are undesirable, since processing time is unpredictable. If a query follows a long branch, the query is relatively slow. For this reason, inserts are managed to maintain balanced indexes: 1. In a dense index, inserts always generate new bottom-level index entries. In a sparse index, inserts generate new bottom-level index entries when table blocks split. 2. If the new index entry goes in a full index block, the block splits. Half of the rows move to the new block, creating space for the entry. 3. The new block in the bottom level generates a new index entry the next level up. If block in the next level up is full, the block splits and the process repeats. 4. If blocks are full at all index levels, the split propagates to the top level. In this case, the top-level block splits and a new level is created.

Answer 41

B+tree The B+tree structure has two benefits: 1. The bottom level of a B+tree is a single-level index and can be scanned or searched. 2. In a B-tree, inserts, updates, and deletes may cause a table pointer to change levels, which is hard to implement. B+trees do not have this problem, since table pointers are always in the bottom level. Because of the advantages above, multi-level indexes are usually implemented as B+trees.

Answer 42

B-tree In a B-tree, inserts, updates, and deletes may cause a table pointer to change levels, which is hard to implement. B+trees do not have this problem, since table pointers are always in the bottom level. 1. The bottom level of a B-tree is like the bottom level of a B+tree. 2. Higher levels of a B-tree alternate index block pointers and table block pointers. 3. Entries that appear in higher levels are not repeated in the bottom level.

Answer 43

hash index A hash index is similar to a hash table, described in another section. However, a hash index stores index entries in each bucket, while a hash table stores table rows in each bucket. Sometimes the term hash index is used to mean a hash key, but the two terms are different. A hash index is an index that is structured using a hash function. A hash key is a column that determines the physical location of rows in a hash table. 1. The table is clustered on DepartureAirportCode. 2. The hash index is on FlightNumber. The hash function is modulo 10. 3. Index entries for flight numbers ending in 0 go in bucket 0. 4. Index entries for flight numbers ending in 1 go in bucket 1. 5. Since the hash function is modulo 10, the hash index has 10 buckets.

Answer 44

bucket Initially, each bucket has one block. As an index grows, some buckets eventually fill up, and additional blocks are allocated and linked to the initial block.

Answer 45

hash function To locate a row containing a column value, the database: 1. Applies the hash function to the column value to compute a bucket number. 2. Reads the index blocks for the bucket number. 3. Finds the index entry for the column value and reads the table block pointer. 4. Reads the table block containing the row.

Answer 46

bitmap index ## Footnote 1. Each index row corresponds to a unique table row. If the table's primary key is an integer, the index row number might be the primary key value. Alternatively, the index row number might be an internal table row number, maintained by the database. 2. Each index column corresponds to a distinct table value. Ex: If the index is on AirportCode, each index column corresponds to a different three-letter airport code. The mapping of index column numbers to table values is computed with a function or stored in an internal 'lookup' table. 1. The database maintains row numbers internally. 2. A bitmap index is a grid of bits. Bitmap index rows correspond to internal table row numbers. 3. Bitmap index columns correspond to table values. 4. Table row 0 contains ATL, so the corresponding index bit is 1. 5. Table row 1 contains DUB, so the corresponding index bit is 1. 6. Each 1 in the bitmap index indicates the corresponding value appears in the table row. 7. Each 0 in the bitmap index indicates the corresponding value does not appear in the table row.

Answer 47

physical index

Answer 48

logical index Logical indexes are always secondary indexes and require a separate primary index on the same table. To locate a row containing a column value, the database: 1. Looks up the column value in the logical index to find the primary key value. 2. Looks up the primary key value in the primary index to find the table block pointer. 3. Reads the table block containing the row. 1. The table is sorted on FlightNumber, the primary key. This example assumes FlightNumber is unique across all airlines. 2. The table has primary index on FlightNumber. The primary index is physical and sparse. 3. The table has secondary index on DepartureAirportCode. The secondary index is logical and dense. 4. To find the row containing EUG, the database finds the primary key 200 in the secondary index, then finds the block pointer in the primary index leading to the block containing 200.

Answer 49

function index Index entries contain the result of the function applied to column values, rather than the column values. Ex: Column values are stored as decimal numbers between 0 and 1, but users specify percentages as integers between 0 and 100 in queries. The database designer specifies a function index that multiplies column values by 100 and converts the result to an integer. The index contains integers between 0 and 100, so the database can use the function index to process queries. 1. The table contains times in AM/PM format. 2. Queries specify time in 24-hour format. 3. The function index converts AM/PM format to 24-hour format. 4. The function index format is consistent with the WHERE clause, so the database can use the index.

Answer 50

Expected: Flight numbers 735, 905, 255, 885, and 995 The hash function is modulo 10, so index entries for flight numbers ending in 5 go in bucket 5. 735, 905, 255, 885, and 995 are the flight numbers that end in 5.

Answer 51

6:00 and 7:35

Answer 52

tablespace The CREATE TABLESPACE statement names a tablespace and assigns the tablespace to a file. The CREATE TABLE statement assigns a table to a tablespace. Indexes are stored in the same tablespace as the indexed table.

Answer 53

1. CodeTablespace contains small, read-only code tables and is stored on a flash drive for fast reads. 2. The State table is assigned to CodeTablespace. 3. Additional tables are assigned to the same tablespace to reduce overhead.

Answer 54

partition One table has many partitions that do not overlap and, together, contain all table data.

Answer 55

horizontal partition MySQL and most relational databases partition tables horizontally, not vertically.

Answer 56

vertical partition MySQL and most relational databases partition tables horizontally, not vertically.

Answer 57

1. The Employee table is partitioned on hire date. 2. Each partition corresponds to one year. 3. Each partition is stored in a separate tablespace.

Answer 58

partition expression The partition expression may be simple, such as the value of a single partition column, or a complex expression based on several partition columns.

Answer 59

range partition The VALUES LESS THAN keywords specify the upper bound of each range. The MAXVALUE keyword represents the highest column value, and VALUES LESS THAN MAXVALUE specifies the highest range. Each partition is explicitly named by the database administrator. 1. Order table has columns for order number, date, product code, and amount of order. 2. Range partition assigns partitions based on OrderDate year. 3. Rows are assigned to one of the three partitions based on year.

Answer 60

list partition Like a range partition, each partition is explicitly named. 1. Order table has columns for order number, date, product code, and amount of order. 2. List partition assigns partitions based on specific ProductCodes. 3. Each partition contains rows with specific product codes.

Answer 61

hash partition The database administrator specifies the number of partitions, N, and partitions are automatically named p0 through p(N-1). The partition number for each row is computed as: (partition expression value) modulo N.

Answer 62

key partition ## Footnote 1. Order table has columns for order number, date, product code, and amount of order. 2. Range partition assigns partitions based on OrderDate year. 3. Rows are assigned to one of the three partitions based on year. 4. List partition assigns partitions based on specific ProductCodes. 5. Each partition contains rows with specific product codes. 6. Hash partition expression is HASH(OrderNumber) with ten partitions. 7. Each row is assigned to the partition OrderNumber modulo 10, so assigned partitions are based on the order number's last digit.

Answer 63

Logical design ## Footnote 1. Logical design specifies tables and columns. 2. Logical design also specifies primary and foreign keys.

Answer 64

Physical design ## Footnote Physical design affects query performance but never affects query results. 1. Physical design specifies table structure. Flight table is sorted on DepartureAirportCode. 2. Physical design also specifies indexes and index types.

Answer 65

storage engine or storage manager Storage engines support different index and table structures, so physical design is dependent on a specific storage engine.

Answer 66

CREATE INDEX Most indexes specify just one column, but a composite index specifies multiple columns.

Answer 67

DROP INDEX

Answer 68

SHOW INDEX SHOW INDEX generates a result table with one row for each column of each index. A multi-column index has multiple rows in the result table.

Answer 69

1. The Address table has primary key AddressID and foreign key CityID. 2. The CREATE INDEX statement creates a secondary index on PostalCode. 3. The SHOW INDEX statement displays all indexes on the Address table. Some columns of the result table have been omitted for clarity. 4. MySQL created indexes on the primary and foreign keys automatically when Address was created. AddressID has 603 distinct index values, and CityID has 99 distinct index values. 5. PostalCodeIndex was created with the CREATE INDEX statement. PostalCodeIndex is an index on a non-unique column with 597 distinct values.

Answer 70

EXPLAIN statement EXPLAIN syntax is simple and uniform in most databases: EXPLAIN statement; The statement can be any SELECT, INSERT, UPDATE, or DELETE statement. Although EXPLAIN syntax is standard, the result table varies significantly by database. In MySQL with InnoDB, the result table has one row for each table in the statement. If the statement contains multiple queries, such as a main query and a subquery, the result table has one row for each table in each query.

Answer 71

1. Address table has 603 rows. City table has 57 rows. 2. The SELECT statement joins Address and City tables. 3. EXPLAIN generates a results table with one row for each table in the SELECT statement. 4. The Address table is accessed via table scan. Of 603 rows read, 33% are selected. 5. The City table is accessed via primary index. One City row is read and selected for each Address row.

Answer 72

1. Create initial physical design. Create a primary index on primary keys and a secondary index on foreign keys. In MySQL with InnoDB, these indexes are created automatically for all tables. In other databases, this step is necessary for tables larger than roughly 100 kilobytes, but can be omitted for smaller tables. 2. Identify slow queries. The MySQL slow query log is a file that records all long-running queries submitted to the database. Identify slow queries by inspecting the log. Most other relational databases have similar query logs. 3. EXPLAIN slow queries. Run EXPLAIN on each slow query to assess the effectiveness of indexes. A high value for rows and a low value for filtered indicates either a table scan or an ineffective index. 4. Create and drop indexes based on the EXPLAIN result table. Consider creating an index when the rows value is high and the filtered value is low. Consider dropping indexes that are never used. 5. Partition large tables. If some queries are still slow after indexes are created, consider partitions. Partition when slow queries access a small subset of rows of a large table. The partition column should appear in the WHERE clause of slow queries. Often, a range partition is best.

Answer 73

1. The SELECT statement appears in the MySQL slow query log. 2. The database administrator runs EXPLAIN on the SELECT query. 3. The EXPLAIN result shows no index was used, many rows were read, and few rows were selected. 4. To speed up the SELECT query, the database administrator creates an index on the CityName column. 5. Running EXPLAIN again shows the query uses CityNameIndex, reads only one row, and executes faster.

CIS275 - Chapter 5: Data Storage Flashcards

(125 cards)