3. B+ Trees Flashcards
What is an index?
An index is a data structure that helps speed up reads on a specific key
3 B+ trees properties
- The number d is the order of a B+ tree. Each node (with the exception of the root node) must have d ≤ x ≤ 2d entries assuming no deletes happen (it’s possible for leaf nodes to end up with < d entries if you delete data). The entries within each node must be sorted.
- In between each entry of an inner node, there is a pointer to a child node. Since there are at most 2d entries in a node, inner nodes may have at most 2d + 1 child pointers. This is also called the tree’s fanout.
- The keys in the children to the left of an entry must be less than the entry while the keys in the children to the right must be greater than or equal to the entry.
Which nodes of B+ trees contain records (or pointers to them)?
Only the leaf nodes contain records.
Describe B+ tree insertion
(1) Find the leaf node L in which you will insert your value. You can do this by traversing down the tree. Add the key and the record to the leaf node in order.
(2) If L overflows (L has more than 2d entries)…
(a) Split into L1 and L2. Keep d entries in L1 (this means d + 1 entries will go in L2).
(b) If L was a leaf node, COPY L2’s first entry into the parent. If L was not a leaf node, MOVE L2’s first entry into the parent.
(c) Adjust pointers.
(3) If the parent overflows, then recurse on it by doing step 2 on the parent.
Describe B+ tree deletion
To delete a value, just find the appropriate leaf and delete the unwanted value from that leaf. That’s all there is to it. (Yes, technically we could end up violating some of the invariants of a B+ tree. That’s okay because in practice we get way more insertions than deletions so something will quickly replace whatever we delete.)
3 alternatives for storing the actual records in B+ trees
In the Alternative 1 scheme, the leaf pages are the data pages. Rather than containing pointers to records, the leaf pages contain the records themselves.
In the Alternative 2 scheme, the leaf pages hold pointers to the corresponding records.
In the Alternative 3 scheme, the leaf pages hold lists of pointers to the corresponding records. This is more compact than Alternative 2 when there are multiple records with the same leaf node entry.
Unclustered index - explain
In an unclustered index, the data pages are complete chaos. Thus, odds are that you’re going to have to read a separate page for each of the records you want.
Clustered index - explain
In a clustered index, the data pages are sorted on the same index on which you’ve built your B+ tree. This does not mean that the data pages are sorted exactly, just that keys are roughly in the same order as data.
Reason about the performance of clustered vs. unclustered indices
The difference in I/O cost comes from caching, where two records with close keys will likely be on the same page, so the second one can be read from the cached page.
- UNCLUSTERED = ∼ 1 I/O per record.
- CLUSTERED = ∼ 1 I/O per page of records.
Describe B+ trees bulk-loading
(1) Sort the data on the key the index will be built on.
(2) Fill leaf pages until some fill factor f.
(3) Add a pointer from parent to leaf page. If the parent overflows, we will follow a procedure similar to insertion. We will split the parent into two nodes:
(a) Keep d entries in L1 (this means d + 1 entries will go in L2).
(b) Since a parent node overflowed, we will MOVE L2’s first entry into the parent.
(4) Adjust pointers