Approximation Flashcards
Bloom Filter Interface
probabilistic with one-sided errors:
* if the key is not in the set, contains will return false
(no false negatives)
* if the key is in the set, contains may return true or false
(false positives are possible)
10 bits for 1% false positive
Multiple Hash Functions
number of elements in set: n
* filter size (in bits): m
* number of hash functions: k
* insert sets all k bits to 1, remove not possible
* contains returns true if all k bits are 1, false otherwise
Blocked Bloom Filter
- low false positive rates require larger k
- this increases number of memory accesses
- idea: hash-partition filter and perform only one access
- cache line blocks of 64 bytes (512 bits)
- register blocks of 8 bytes (64 bits)
Count-Distinct Estimation: Basic Idea
- How many distinct values are there in a sequence (or stream) of keys?
- hash each key and count the number of consecutive leading zeros (0011011 -> 2)
Count-Distinct Estimation: problem and solution
two problems:
* using only one maximum leads to estimates with high variance
* a single outlier that happens to have many leading zeros will lead to severe
overestimation
solution:
* use multiple sketches (maxima), partitioned by hash
Succinct data structure: space consumption
space bound L
2L is compact
L + √L is succinct
L + 3 is implicit
Level-Ordered Unary Degree Sequence (LOUDS)
- stores nodes in breadth-first order
- encodes each node’s degree (# of children) using unary code
- no pointers, the tree is one bit string
- LOUDS needs 2 bits per node (2n)
Fast Succinct Trie (FST)
- span of 8 bits (maximum fanout is 256)
- nodes are encoded in level order (like LOUDS)
- two encodings: LOUDS-Dense and LOUDS-Sparse
- space-efficient and reasonable performance
FST: LOUDS-Dense
four arrays:
* 256-bit sequence D-Labels: branch present?
* 256-bit sequence D-HasChild: child or leaf?
* 256-bit sequence D-IsPrefixKey: is the key a prefix?
* sequence for the values (D-Values)
FST: LOUDS-Sparse
four arrays:
* byte sequence S-Labels: records all the branching labels for each trie node
* bit sequence S-HasChild: inner or leaf node?
* bit sequence S-LOUDS: 1 bit for each label, indicates node boundaries
* sequence for the values (S-Values)
Learned Indexes
- idea: use statistical model (or ML) to predict location of sorted data
- models cumulative distribution function (CDF) of keys
- what three parameters n, m and k determine the false positive rate of a
bloom filter?
k: number of hash functions
n: number of elements in set
m: filter size in bits
- how can we improve the cache locality of a bloom filter?
blocked bloom filter
Learned indexes: issues
- bad cache locality: every lookup activates entire model
- “last mile”: predicts approx. location well, but getting error down more
requires more effort