Search Trees Flashcards
Ordered Data Sets
hashing stores items in random order
data should be ordered by key to support insert, remove, lookup
operations: lower bound, upper bound, range scan
Standard Binary Search (arrays)
logn steps if array is sorted
0.5 branch misses per comparison on avg for random searches
access pattern is not cache friendly
on large arrays is slower than in B-tree
Branch-Free Binary Search
conditional branch can be transformed to data dependency
no branch misses for small arrays
Interpolation Search
assumes keys are uniformly distributed(distance between Elems is same/close to it) 10 13 15 26 28 30
a single, very large or very small outlier will destroy interpolation
can be combined with binary search
Linear Search
O(n) steps
can be improved using SIMD instructions
good locality
array doesn’t have to be sorted for point search
best solution for small arrays
Blocking Layout
store hot keys together in blocks
recursive structure
improves locality
allows use of SIMD instructions (k-ary search)
Growing Arrays
better locality than linked lists
one downsize is growth because of limited size
standard approach: exponential growth plus copy
Growing Arrays without Copying
64 pointers pointing to array of values
good locality
Comparison-based Search Trees (binary search trees)
compare one elem at time
each comparison rules out half the elems (if tree is balanced)
log2n comparisons and height
Binary Search Trees
each node stores one key/value pair and two child pointers
complex rebalancing logic
bad cache locality
pointer based: elem doesnt have to be moved physically during rebalancing
Adv:
if you insert an element, you just allocate a new entry and point into tree
if tree rebalanced: no need to copy, you just change pointer structure of the tree
Skip List
elems in sorted linked list + additional shortcuts on top
probabilistic data structure (no worst case guarantees)
all items are stored in a sorted order by the key
B+-tree
variant of B tree where inner nodes only store pointers, no values
leaf nodes are often chained to simplify range scan
Sorted Arry
worst case for n elems:
lookup log2n
insert n
Sorted Blocks of Fixed Size
split array into small sorted arrays of b elems, b is fixed
Inner Nodes
indicate path to leaf node that has actual data
tree height: logb(n)
comparisons: log2n
B-tree logic decomposition
two parts:
1. structure-modification operations (split, merge, new root) determine shape of B tree
2. per-node operations (lower bound, insert, delete, iterate) work on nodes locally
(2) can be changed without affecting high level logic (1)
Physical Node Layouts
most common layout: array of sorted key/value pairs
binary search within each node
insert/delete moves half of elems on avg(to the right)
alternatives: store keys and values separately, unsorted leaf nodes but sorted immer nodes, encode binary search tree, trie or hash table within node
what is the best node size?
in main memory, node size (b) can be freely chosen
node size should be at least cache line size
optimal page size for lookup is bigger that for inserts
B trees with variable-length data
bad cache locality, because need ti dereference the pointer which generally will not be in cache
lots of memory allocation
memmory fragmentation from memory allocator
Slotted page layout
enables variable-lenght data in sorted order
slots decouple logical from physical ordering
slots store offset and length
Compaction
deletion in slotted page leaves holes
keep holes after deletion, but keep track of how many bytes are wasted
when running out of space, one can trigger compaction if its beneficial
best implemented by copying all keys to new empty page
Separator Choice
when splitting node: choose shortest key around median as separator
Suffix Truncation of Separator
keys in inner nodes are only used as signposts, do not have to be stored in leaf nodes
if separator is ABCDE and next key is AXYZ, separator can be truncated to AX
Prefix Truncation, its difficulty
extract common prefix between smallest and largest key on node
on lookup, compare prefix first
speeds up comparison and saves space
issue: insertion of key may cause prefix to shrink -> all keys need to be enlarged, may not fit into node anymore, causing a split
solution: store fence keys on each page
Fence keys for prefix truncation
use common prefix of fence keys
those are immutable after page was created by split and merge
can be stored on page or derived on demand
key head in slot optimization
with slotted page layout, binary search requires loading key for every comparison
extract or copy first 4 bytes of key into slot
Hints
extract a fixed number of equi-distant heads and store them consecutively
Textbook merge (deletion and space management in b trees)
to avoid underfull nodes after deletion, one must merge or rebalance
if node is underfull (<50% fill factor):
find a neighboring node
if neighboring node has low enough fill factor:
merge two nodes into one, remove separator in parent, if parent is underfull merge parent the same way
else rebalance elems between two nodes
B*tree
guarantees 66% fill factor.
insert and delete rebalance between 3 neighboring nodes
Xmerge
insead of balancing during insert and delete, increase fill factor in background
if fill factor is low, merge X neighboring nodes into X-1 nodes
B-tree: Cascading Insert
Scenario:
* leaf node is full, have to split and insert separator into parent
* parent is also full, have to split and insert separator into its parent
* repeat…
→ Insert can cascade upwards until a new root has to be created
Alternative: eager split
* when walking the tree during insert, split full inner nodes eagerly
* thus, when hitting the root node, the parent always has space
* simpler, can be faster, but average tree fill factor a bit lower
why is a B-Tree usually faster than a binary tree
Instead of storing one key and having two children, B-tree nodes have n keys and n+1 children, where n can be large. This shortens the tree (in terms of height) and requires much less disk access than a binary search tree would.
- in which scenario can binary search be slower than linear search?
if data is unsorted
what data structure is a probabilistic alternative to a binary tree?
Skip lists
- what is the main difference between a B+
-Tree and a classical B-Tree?
insert and delete rebalance between 3 neighboring nodes
- how many comparisons does binary search require on an array with N
entries?
logN
- how can you speed up lookup in a B-Tree?
link leaf nodes together, because range queries are faster