Approximation Flashcards

Question 1

Q

Bloom Filter Interface

Answer

A

probabilistic with one-sided errors:
* if the key is not in the set, contains will return false
(no false negatives)
* if the key is in the set, contains may return true or false
(false positives are possible)
10 bits for 1% false positive

Question 2

Q

Multiple Hash Functions

Answer

A

number of elements in set: n
* filter size (in bits): m
* number of hash functions: k
* insert sets all k bits to 1, remove not possible
* contains returns true if all k bits are 1, false otherwise

Question 3

Q

Blocked Bloom Filter

Answer

A

low false positive rates require larger k
this increases number of memory accesses
idea: hash-partition filter and perform only one access
cache line blocks of 64 bytes (512 bits)
register blocks of 8 bytes (64 bits)

Question 4

Q

Count-Distinct Estimation: Basic Idea

Answer

A

How many distinct values are there in a sequence (or stream) of keys?
hash each key and count the number of consecutive leading zeros (0011011 -> 2)

Question 5

Q

Count-Distinct Estimation: problem and solution

Answer

A

two problems:
* using only one maximum leads to estimates with high variance
* a single outlier that happens to have many leading zeros will lead to severe
overestimation
solution:
* use multiple sketches (maxima), partitioned by hash

Question 6

Q

Succinct data structure: space consumption

Answer

A

space bound L
2L is compact
L + √L is succinct
L + 3 is implicit

Question 7

Q

Level-Ordered Unary Degree Sequence (LOUDS)

Answer

A

stores nodes in breadth-first order
encodes each node’s degree (# of children) using unary code
no pointers, the tree is one bit string
LOUDS needs 2 bits per node (2n)

Question 8

Q

Fast Succinct Trie (FST)

Answer

A

span of 8 bits (maximum fanout is 256)
nodes are encoded in level order (like LOUDS)
two encodings: LOUDS-Dense and LOUDS-Sparse
space-efficient and reasonable performance

Question 9

Q

FST: LOUDS-Dense

Answer

A

four arrays:
* 256-bit sequence D-Labels: branch present?
* 256-bit sequence D-HasChild: child or leaf?
* 256-bit sequence D-IsPrefixKey: is the key a prefix?
* sequence for the values (D-Values)

Question 10

Q

FST: LOUDS-Sparse

Answer

A

four arrays:
* byte sequence S-Labels: records all the branching labels for each trie node
* bit sequence S-HasChild: inner or leaf node?
* bit sequence S-LOUDS: 1 bit for each label, indicates node boundaries
* sequence for the values (S-Values)

Question 11

Q

Learned Indexes

Answer

A

idea: use statistical model (or ML) to predict location of sorted data
models cumulative distribution function (CDF) of keys

Question 12

Q

what three parameters n, m and k determine the false positive rate of a
bloom filter?

Answer

A

k: number of hash functions
n: number of elements in set
m: filter size in bits

Question 13

Q

how can we improve the cache locality of a bloom filter?

Answer

A

blocked bloom filter

Question 14

Q

Learned indexes: issues

Answer

A

bad cache locality: every lookup activates entire model
“last mile”: predicts approx. location well, but getting error down more
requires more effort

Approximation Flashcards

(14 cards)