Indexing Flashcards

1
Q

What are the two methods for reducing access costs?

A

<ol>
<li>Ordered indexing</li>
<li>Hash indexing</li>
</ol>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the basic overview of indices

A

Speeds up access to desired data by creating an index file consisting of records containing a search key and a pointer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the search key?

A

Attribute or set of attributes used to look up records in a file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the factors in index design?

A
<ul>
<li>Access types (specific vs range)</li>
<li>Access time</li>
<li>Insertion time</li>
<li>Deletion time</li>
<li>Space taken by the index</li>
</ul>
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the primary index?

A

The index whose search key specifies the order of records in the data file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a primary index also knowna as

A

Clustering index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe dense primary indices

A

Index record appears for every search key value in the file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe sparse primary indices

A

A sparse index has index records for only some of the key values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are records found with sparse primary indices?

A

Choose the index record with the greatest value <= search key and scan sequentially from that record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a benefit of sparse indices?

A

Takes less space and requires less maintenance, can insert without much bother

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a benefit of dense indices?

A

Gives faster lookups, can go directly to the record

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

If the primary index doesn’t fit in memory, how can sparse indices be used?

A

Treat primary index on a disk as a sequential file and construct a sparse index on it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is insertion done for dense indices?

A

If the search key value does not appear in the index, insert it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is insertion done for sparse indices?

A

No changes need to be made unless a new block is created, and if new block created then first search key value appearing in the block is inserted into the index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does deletion happen for dense indices?

A

If only one record at that search key then delete, otherwise update the pointer to point towards the values that need to continue to be stored only

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does deletion happen for sparse indices if the deleted value was the only record in the file with it’s search key value?

A

If an entry for the next search key value already exists in the index, delete the index entry for the deleted record’s search key
Otherwise
Update the index entry to point to the first record with the next search key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does deletion happen for sparse indices if the deleted value was not the only record in the file with it’s search key value?

A

Update the index entry to point to the next record with the same search key value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Where do pointers in secondary index point to?

A

Buckets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the buckets in terms of secondary index referencing?

A

Grouping of pointers to records of the same search key

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why do secondary indices need to be dense?

A

File is not stored in the search key (secondary index) order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the main disadvantage of index sequential file organisation?

A

Performance degrades over time, periodic reorganisation may be necessary to maintain performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the advantage of a B+ tree?

A

Automatically reorganised with small, local changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the disadvantages of B+ trees?

A

Insertion and deletions are costlier, space overhead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the length of paths from root to leaf in a B+ tree

A

All paths from root to leaf are of the same length

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the range of children that an internal node can have in a B+ tree?
Between the ceiling of n/2 and n children
26
How many values can a leaf node have between?
Ceiling of (n-1)/2 and n-1 values
27
If a root is not a leaf how many children must it have?
At least 2
28
What do the pointers point to in a non leaf node?
Children
29
What do the pointers point to in a leaf node?
Records or pointer to next leaf node
30
What do non leaf nodes form in a B+ tree?
A multi level sparse index
31
What order of magnitude is the number of levels in a B+ tree?
Logarithmic
32
What is the maximum path length in lookup in a B+ tree?
Floor of logn/2(K)
33
What are the steps for insertion in B+ tree if search key is not present?
  1. Add the record to main file and create a bucket if necessary
  2. If there is room in relevant leaf node, add new key value,pointer pair
  3. Otherwise, split the node and add the new key value pointer pair
34
What are the two strategies to bypass the write inefficiency of B+ trees?
  1. Sort entries first and insert in sorted order
  2. Sort entries first and create tree layer by layer, starting with leaf level
35
Describe the steps in Bottom up B+ tree construction
  1. Sort the values
  2. Create the linked list representing the leaf nodes
  3. Create the next level, left to right
36
What are the two approaches to reducing the cost of writes?
  • Log structured merge tree
  • Buffer tree
37
Describe at a high level how log structured merge trees work
Use bottom up construction
38
Give a high level description of buffer trees
Insert multiple ordered indices at once
39
Give a detailed description about how log structured merge trees work
  • Compromises multiple LSM trees of increasing size with at least L0 fitting in memory
40
How does LSM insert work?
Values are inserted into L0 until L0 is full. Records are then moved onto L1 and into the disk, with bottom up construction done
41
In what use case is LSM useful?
If data permenantly matures over time, no additional changes to come back and make
42
How does LSM delete work?
Handled by special "delete" entries, which are dummy values to say it has been edleted. When trees are merged, entries are omitted if they contain the delete value.
43
What are the benefits of using LSM trees?
  • Inserts are done using only sequential I/O operations, minimal block access
  • Leaves are full, no space wastage
  • Reduced number of I/O operations per record insert
44
What are the disavantages of using LSM trees?
  • Queries have to search multiple trees
  • Entire copy of each level made multiple times
45
What is the key idea with a buffer tree?
Each internal node of the B+ tree has a buffer to store inserts, when buffer is full, records are sorted on search key and moved to appropriate child
46
What are the benefits of using buffer trees?
  • Less overhead on queries
  • Can be used with any tree based structure. flexible
47
What are the disadvantages of using buffer trees?
  • More random I/O than LSM trees
48
What is the time complexity of access in ordered indices?
O(n)
49
What is the time complexity of access in B+ trees?
O(logn)
50
What is a bucket?
Unit of storage containing one or more entries
51
What is a hash function?
Function from the set of all search key values to the set of all bucket addresses
52
What are hash functions used for?
Used to locate entries for access, insertion and deletion
53
What is stored in the hash index?
Entries with pointers to records
54
What would the worst hash function do?
Map all the search keys to the same bucket
55
Give some features of ideal hash functions
  • Uniform
  • Pseudo-random
56
What is a uniform hash function?
Where each bucket is assigned the same number of search key values from the set of all possible values
57
What is a pseudorandom hash function?
Each bucket will have the same number of records irrespective of the actual distribution of search key values
58
What do typical hash functions perform computation on?
The internal binary representation of the search key
59
What is a static hash function?
Maps search key values to a fixed set of bucket addresses
60
What happens if we choose a set size that is too large for static hashing?
Space is wasted, too many buckets
61
What happens if we choose a set size that is too small for static hashing?
Too many values map to a given bucket, performance suffers as linear search and overflow buckets are created
62
What should be the chosen number of buckets for static hashing?
(number of records that we think we will have)/(number of records that fit in a bucket) * (1+d) where d is a "fudge factor"
63
What can cause bucket overflow?
  • Insufficient buckets
  • Skew in actual distribution of records
64
How is overflow handled in static hashing?
Use of overflow buckets and overflow chaining
65
What kind of data structure is used for overflow chaining?
Linked list
66
What is the main problem with static hashing?
Hash function can't easily be changed once database contains data, can't account for changes in the number of records
67
What is the alternative solution to static hashing problems that is not dynamic hashing?
Periodically restructuring by changing the hash function but this is very expensive and disrupts normal operation
68
Describe the extendable hashing scheme at a high level
Allows the database to change size by splitting and combining buckets. Reorganisation performed one bucket at a time, so overhead is incremental.
69
Give details about how extendable hashing works
Hash function generates values over a large range represented by b bit integers Buckets created on demand Not all b bits of the hash used
70
What are the first i bits of the hash used as?
An index into an additional table of bucket addresses
71
Describe the process of lookup in extendable hashing
Calculate the hash value, take the i most significant bits and use decimal value of these bit as a numerical index into the bucket address table
72
What are the steps for insertion in extendable hashing?
Lookup to find bucket, if there is enough room insert it, or if there is not, split the bucket and redistribute entries
73
What are the steps for lookup in extendable hashing?
Compute the hash function, use the first i bits as a decimal index into the bucket address table, follow pointer to correct bucket and do sequential search
74
What values do you have to look at to split a bucket?
ij of the desired bucket and the number of bits used for i
75
What are the two cases for ij when you are looking to split a bucket?
  • ij < i
  • ij = i
76
What can happen in the case where ij < i?
Can split the bucket without increasing the size of the bucket address table
77
Describe the steps that happen when you are splitting a bucket without increasing BAT size
Allocae new bucket and change ij and iz to ij+1. Rehash all records in bucket j, alllocate them to correct bucket
78
What can happen in the case where ij = i?
Can't split the bucket without expanding the size of BAT
79
Describe splitting the bucket when you also need to increase the BAT size
  1. Increment i by one (doubling size of BAT)
  2. Each BAT cell is replaced by two cells with pointers to the same bucket
  3. Split each cell into two cells which both point to the same bucket that they would have been in originally
  4. Restart the insertion
80
What are the benefits of using hash indexing?
  • Does not degrade with growth of file
  • Minimal space overhead
  • Rehasing is incremental and one bucket at a time, very local changes
81
What are the disadvantages of using hash indexing?
  • Extra level of indirection to find desired record (BAT)
  • BAT may itself become very large
82
What is a potential solution to solve the BAT getting hard to store when itself becomes very large?
B+ structure to locate desired record in BAT
83
What are the relevant factors to consider when deciding between ordered or hash indexing?
  • Can you deal with having to reorganise hash?
  • How often will you be inserting or deleting?
  • Is average or worst case time more important?
  • What type of queries are most common?
84
Describe the performance of ordered indices and hash indices when you're performing a select query looking for the exact value of a predicate
Ordered index
  • Average time and worst case proportional to log number of values of that attribute
``` Hash
  • Average time constant
  • Worse case proportional to number of values of that attribute
```
85
What method would you use for values that have an exact equality to some given constant?
Hash
86
What method would you use for a query where you are looking for values within a range?
Ordered index
87
Why is a hash bad for looking for values in a rage?
No easy way to find the next bucket in order