Indexing Flashcards

1
Q

What are the two methods for reducing access costs?

A

<ol>
<li>Ordered indexing</li>
<li>Hash indexing</li>
</ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the basic overview of indices

A

Speeds up access to desired data by creating an index file consisting of records containing a search key and a pointer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the search key?

A

Attribute or set of attributes used to look up records in a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the factors in index design?

A
<ul>
<li>Access types (specific vs range)</li>
<li>Access time</li>
<li>Insertion time</li>
<li>Deletion time</li>
<li>Space taken by the index</li>
</ul>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the primary index?

A

The index whose search key specifies the order of records in the data file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a primary index also knowna as

A

Clustering index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe dense primary indices

A

Index record appears for every search key value in the file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe sparse primary indices

A

A sparse index has index records for only some of the key values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are records found with sparse primary indices?

A

Choose the index record with the greatest value <= search key and scan sequentially from that record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a benefit of sparse indices?

A

Takes less space and requires less maintenance, can insert without much bother

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a benefit of dense indices?

A

Gives faster lookups, can go directly to the record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If the primary index doesn’t fit in memory, how can sparse indices be used?

A

Treat primary index on a disk as a sequential file and construct a sparse index on it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is insertion done for dense indices?

A

If the search key value does not appear in the index, insert it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is insertion done for sparse indices?

A

No changes need to be made unless a new block is created, and if new block created then first search key value appearing in the block is inserted into the index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does deletion happen for dense indices?

A

If only one record at that search key then delete, otherwise update the pointer to point towards the values that need to continue to be stored only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does deletion happen for sparse indices if the deleted value was the only record in the file with it’s search key value?

A

If an entry for the next search key value already exists in the index, delete the index entry for the deleted record’s search key
Otherwise
Update the index entry to point to the first record with the next search key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does deletion happen for sparse indices if the deleted value was not the only record in the file with it’s search key value?

A

Update the index entry to point to the next record with the same search key value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Where do pointers in secondary index point to?

A

Buckets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the buckets in terms of secondary index referencing?

A

Grouping of pointers to records of the same search key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why do secondary indices need to be dense?

A

File is not stored in the search key (secondary index) order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the main disadvantage of index sequential file organisation?

A

Performance degrades over time, periodic reorganisation may be necessary to maintain performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the advantage of a B+ tree?

A

Automatically reorganised with small, local changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the disadvantages of B+ trees?

A

Insertion and deletions are costlier, space overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the length of paths from root to leaf in a B+ tree

A

All paths from root to leaf are of the same length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the range of children that an internal node can have in a B+ tree?

A

Between the ceiling of n/2 and n children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How many values can a leaf node have between?

A

Ceiling of (n-1)/2 and n-1 values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

If a root is not a leaf how many children must it have?

A

At least 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What do the pointers point to in a non leaf node?

A

Children

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What do the pointers point to in a leaf node?

A

Records or pointer to next leaf node

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What do non leaf nodes form in a B+ tree?

A

A multi level sparse index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What order of magnitude is the number of levels in a B+ tree?

A

Logarithmic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the maximum path length in lookup in a B+ tree?

A

Floor of logn/2(K)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What are the steps for insertion in B+ tree if search key is not present?

A

<ol><li>Add the record to main file and create a bucket if necessary</li><li>If there is room in relevant leaf node, add new key value,pointer pair</li><li>Otherwise, split the node and add the new key value pointer pair</li></ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are the two strategies to bypass the write inefficiency of B+ trees?

A

<ol>
<li>Sort entries first and insert in sorted order</li>
<li>Sort entries first and create tree layer by layer, starting with leaf level</li>
</ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe the steps in Bottom up B+ tree construction

A

<ol>
<li>Sort the values</li>
<li>Create the linked list representing the leaf nodes</li>
<li>Create the next level, left to right</li>
</ol>

36
Q

What are the two approaches to reducing the cost of writes?

A

<ul>
<li>Log structured merge tree</li>
<li>Buffer tree</li>
</ul>

37
Q

Describe at a high level how log structured merge trees work

A

Use bottom up construction

38
Q

Give a high level description of buffer trees

A

Insert multiple ordered indices at once

39
Q

Give a detailed description about how log structured merge trees work

A

<ul>
<li>Compromises multiple LSM trees of increasing size with at least L0 fitting in memory</li>
</ul>

40
Q

How does LSM insert work?

A

Values are inserted into L0 until L0 is full. Records are then moved onto L1 and into the disk, with bottom up construction done

41
Q

In what use case is LSM useful?

A

If data permenantly matures over time, no additional changes to come back and make

42
Q

How does LSM delete work?

A

Handled by special “delete” entries, which are dummy values to say it has been edleted. When trees are merged, entries are omitted if they contain the delete value.

43
Q

What are the benefits of using LSM trees?

A

<ul>
<li>Inserts are done using only sequential I/O operations, minimal block access</li>
<li>Leaves are full, no space wastage</li>
<li>Reduced number of I/O operations per record insert</li>
</ul>

44
Q

What are the disavantages of using LSM trees?

A

<ul>
<li>Queries have to search multiple trees</li>
<li>Entire copy of each level made multiple times</li>
</ul>

45
Q

What is the key idea with a buffer tree?

A

Each internal node of the B+ tree has a buffer to store inserts, when buffer is full, records are sorted on search key and moved to appropriate child

46
Q

What are the benefits of using buffer trees?

A

<ul>
<li>Less overhead on queries</li>
<li>Can be used with any tree based structure. flexible</li>
</ul>

47
Q

What are the disadvantages of using buffer trees?

A

<ul>
<li>More random I/O than LSM trees</li>
</ul>

48
Q

What is the time complexity of access in ordered indices?

A

O(n)

49
Q

What is the time complexity of access in B+ trees?

A

O(logn)

50
Q

What is a bucket?

A

Unit of storage containing one or more entries

51
Q

What is a hash function?

A

Function from the set of all search key values to the set of all bucket addresses

52
Q

What are hash functions used for?

A

Used to locate entries for access, insertion and deletion

53
Q

What is stored in the hash index?

A

Entries with pointers to records

54
Q

What would the worst hash function do?

A

Map all the search keys to the same bucket

55
Q

Give some features of ideal hash functions

A

<ul>
<li>Uniform</li>
<li>Pseudo-random</li>
</ul>

56
Q

What is a uniform hash function?

A

Where each bucket is assigned the same number of search key values from the set of all possible values

57
Q

What is a pseudorandom hash function?

A

Each bucket will have the same number of records irrespective of the actual distribution of search key values

58
Q

What do typical hash functions perform computation on?

A

The internal binary representation of the search key

59
Q

What is a static hash function?

A

Maps search key values to a fixed set of bucket addresses

60
Q

What happens if we choose a set size that is too large for static hashing?

A

Space is wasted, too many buckets

61
Q

What happens if we choose a set size that is too small for static hashing?

A

Too many values map to a given bucket, performance suffers as linear search and overflow buckets are created

62
Q

What should be the chosen number of buckets for static hashing?

A

(number of records that we think we will have)/(number of records that fit in a bucket) * (1+d) where d is a “fudge factor”

63
Q

What can cause bucket overflow?

A

<ul>
<li>Insufficient buckets</li>
<li>Skew in actual distribution of records</li>
</ul>

64
Q

How is overflow handled in static hashing?

A

Use of overflow buckets and overflow chaining

65
Q

What kind of data structure is used for overflow chaining?

A

Linked list

66
Q

What is the main problem with static hashing?

A

Hash function can’t easily be changed once database contains data, can’t account for changes in the number of records

67
Q

What is the alternative solution to static hashing problems that is not dynamic hashing?

A

Periodically restructuring by changing the hash function but this is very expensive and disrupts normal operation

68
Q

Describe the extendable hashing scheme at a high level

A

Allows the database to change size by splitting and combining buckets. Reorganisation performed one bucket at a time, so overhead is incremental.

69
Q

Give details about how extendable hashing works

A

Hash function generates values over a large range represented by b bit integers
Buckets created on demand
Not all b bits of the hash used

70
Q

What are the first i bits of the hash used as?

A

An index into an additional table of bucket addresses

71
Q

Describe the process of lookup in extendable hashing

A

Calculate the hash value, take the i most significant bits and use decimal value of these bit as a numerical index into the bucket address table

72
Q

What are the steps for insertion in extendable hashing?

A

Lookup to find bucket, if there is enough room insert it, or if there is not, split the bucket and redistribute entries

73
Q

What are the steps for lookup in extendable hashing?

A

Compute the hash function, use the first i bits as a decimal index into the bucket address table, follow pointer to correct bucket and do sequential search

74
Q

What values do you have to look at to split a bucket?

A

ij of the desired bucket and the number of bits used for i

75
Q

What are the two cases for ij when you are looking to split a bucket?

A

<ul>
<li>ij < i</li>
<li>ij = i</li>
</ul>

76
Q

What can happen in the case where ij < i?

A

Can split the bucket without increasing the size of the bucket address table

77
Q

Describe the steps that happen when you are splitting a bucket without increasing BAT size

A

Allocae new bucket and change ij and iz to ij+1. Rehash all records in bucket j, alllocate them to correct bucket

78
Q

What can happen in the case where ij = i?

A

Can’t split the bucket without expanding the size of BAT

79
Q

Describe splitting the bucket when you also need to increase the BAT size

A

<ol>
<li>Increment i by one (doubling size of BAT)</li>
<li>Each BAT cell is replaced by two cells with pointers to the same bucket</li>
<li>Split each cell into two cells which both point to the same bucket that they would have been in originally</li>
<li>Restart the insertion</li>
</ol>

80
Q

What are the benefits of using hash indexing?

A

<ul>
<li>Does not degrade with growth of file</li>
<li>Minimal space overhead</li>
<li>Rehasing is incremental and one bucket at a time, very local changes</li>
</ul>

81
Q

What are the disadvantages of using hash indexing?

A

<ul>
<li>Extra level of indirection to find desired record (BAT)</li>
<li>BAT may itself become very large</li>
</ul>

82
Q

What is a potential solution to solve the BAT getting hard to store when itself becomes very large?

A

B+ structure to locate desired record in BAT

83
Q

What are the relevant factors to consider when deciding between ordered or hash indexing?

A

<ul>
<li>Can you deal with having to reorganise hash?</li>
<li>How often will you be inserting or deleting?</li>
<li>Is average or worst case time more important?</li>
<li>What type of queries are most common?</li>
</ul>

84
Q

Describe the performance of ordered indices and hash indices when you’re performing a select query looking for the exact value of a predicate

A

Ordered index

<ul>
<li>Average time and worst case proportional to log number of values of that attribute</li>
</ul>

Hash
<ul>
<li>Average time constant</li>
<li>Worse case proportional to number of values of that attribute</li>
</ul>
85
Q

What method would you use for values that have an exact equality to some given constant?

A

Hash

86
Q

What method would you use for a query where you are looking for values within a range?

A

Ordered index

87
Q

Why is a hash bad for looking for values in a rage?

A

No easy way to find the next bucket in order