Week 8-9: Data Streams and Streaming Algorithms Flashcards

Question

Bloom Filter

Answer 1

It's a probabilistic set membership test. Bloom filter is faster than searching through S and smaller than the explicit representation. In theory, there are no chances for a false negative, but there are chances for a false positive. Parameters: 1. |S|, size of stream. Increasing |S| leads to a higher false positive rate. 2. n, size of the bloom filter (array B), increasing n means more space, and a lower false positive rate. 3. k, number of hash functions, as k increases, there's more computation. Note there's usually a optimum value of k. Given a set of keys S that we want to filter 1. Create a bit array B of n bits, initially all 0 2. Choose k hash functions h_1,...,h_k with range [0,n) 3. Hash each member of s \in S. Use k hash functions, h_i(s), i \in [1,k], which map s into random numbers uniformly in [1,n] (need module n if the hash function outputs large numbers). Set the elements in B[h_1(s)],...,B[h_k(s)] to 1 4. When a stream element with key y arrives, use k hash functions. If B[h_1(y)],...,B[h_k(y)] are all 1, output y. Else, discard y. Note that a stream key might be different than any key s \in S, but still might get through if the combined bloom filter has a matching configuration of 1 bits.

Answer 2

When each member s \in S is passed through the hash key, there's a chance that one or more of the bits flipped to 1 are in the same position as the other 1 bits of other members.

Answer 3

Assuming n>>k, the approximation is p_{false} \approx (1 - e^{-k|s|/n})^k Actual False Positive Rate: (1 - (1 - 1/n)^{k|S|})^k

Answer 4

k_{opt} = 0.618^{n/|S|}

Answer 5

This method approximates the number of distinct elements in the last k elements of the stream. Applications: 1. Security monitoring - if more than X attempts, report. 2. Propagation rate of viruses. 3. Distributed computing: multiple parties can combine their sketches to find the number of total distinct elements, and the number of common elements (inclusion/exclusion). Examples include document overlap for plagiarism.

Answer 6

They're used to reduce the disk lookups for non-existent rows of columns. Utilised by Google BigTable, Apache Hbase, Apache Cassandra, and Postgressql

Answer 7

1. Select a hash function h that maps each of N elements to at least log_2 N bits. N is the maximum number of distinct elements. There are no buckets. 2. Define r(h(a)) as the number of 0's from the right. e.g. a -> h(a) = 110 -> r(h(a)) = 2 3. For each element x in stream S, compute r(h(x)). Let R = \underset{x \in S}{\max} r(h(x)). Return 2^R as the estimated number of distinct elements in S. To reduce the error, implement the FM sketch multiple times, which changes the estimate to (2^{R1} + ... + 2^{Rm})/m (standard deviation = \sigma/\sqrt{m} It's suggested to use the correction algorithm of \varphi = 0.77351, so the final estimate if (2^R)/\varphi

Answer 8

For c > 3, Pr(1/c <= (2^R)/F <= c) > 1 - 3/c. However, as c increases (which increases the range for the bound of 2^R), the error bound weakens.

Answer 9

This approach estimates the moments of the last k elements. The k-th frequent moment of a stream comprised of N different types of elements a_1,...,a_N each appearing m_1,...,m_N times is defined as f_k = \sum_{i=1}^N m_i^k. f_0 is the number of distinct elements. f_1 is the total frequency (length of the stream). f_2 shows how uneven the distribution is, as its the sum of the squares of the frequency of each distinct element.

Answer 10

1. Pick some time t (t

Answer 11

For k >= 3, the estimate is n * (c^k - (c-1)^k)

Answer 12

Maintain as many X's as the storage allows and replace them as the stream grows, utilising reservoir sampling. This is X's near the beginning favour early positions, while X's near the end might not have many different elements.

Week 8-9: Data Streams and Streaming Algorithms Flashcards

(36 cards)