Streaming Flashcards
Describe the streaming model and its metrics.
What we have:
• (Sequential) machine with limited amount of working memory
• Input provided as a continuous (one-way) stream:
Σ = x1, x2, . . . xn accessed/processed sequentially
Metrics:
• Metric 1: Size s of the working memory (aim: s ≪ n)
• Metric 2: Number p of sequential passes over Σ (aim: p = 1, in most cases you can read the input only 1 time)
• Metric 3: Processing time per item T (aim: T = O(1))
Give the definition of the majority problem, describe the Boyer-Moore algorithm and provide its related theorem.
Problem:
Given a stream Σ = x1, x2, . . . , xn, return the element x (if any) that occurs > n/2 times in Σ.
Boyer-Moore algorithm: Initialization: cand ← null, count ← 0. For each xi in Σ do: ..........if count == 0 then: ....................cand ← xi ....................count ← 1 ..........elif cand == xi: ....................count ← count + 1 ..........else ....................count ← count − 1 ..........return cand
Theorem: Given a stream Σ which contains a majority element m, the Boyer-Moore algorithm returns m using: • working memory of size O (1); • 1 pass; • O (1) time per element
- *Study the high-level proof on the slides!**
- Define the sets Scand & Sopp
- Use contraddiction
Give the definition of m-sample (referring to the uniform sampling).
What is a problem of the uniform sampling?
Given a set X of n elements and an integer 1 ≤ m < n, an m-sample of X is a random subset S ⊂ X of size m, such that for each x ∈ Σ, we have Pr(x ∈ S) = m/n (uniform sampling).
Bad scenario for using the uniform sampling: unbounded streams, you don’t know n.
Give the general definition of sampling problem.
Given a (possibly unbounded) stream Σ = x1, x2, . . . and an integer m < |Σ|, maintain, for every t ≥ m, an m-sample S of the prefix Σt = x1, x2, . . . , xt.
Describe the Reservoir sampling algorithm and provide its related theorem.
Algorithm used to sample an unbounded stream.
Initialization: S ← ∅
for each xt in Σ do: .......if t ≤ m then: .............add xt to S .......else (with probability m/t) do: .............evict a random element x from S .............add xt to S
Theorem:
Let Σ = x1, x2, …. For any time t ≥ m, the set S maintained by the reservoir sampling algorithm is an m-sample of Σt = x1, x2, . . . , xt.
- *Study the proof on the slides!**
- Proof by induction on t
Give the definition of the frequent items problem and the ϵ-approximated version.
Problem:
Given a stream Σ = x1, x2, . . . , xn of n items and a frequency threshold φ ∈ (0, 1), determine all distinct items that occur at least φ·n times in Σ (we call them frequent items).
ϵ-Approximate Frequent Items (ϵ-AFI) problem:
Given the stream Σ = x1, x2, . . . , xn of n items, a frequency threshold φ ∈ (0, 1), and an accuracy parameter ϵ ∈ (0, φ), return a set of distinct items that:
1) includes all items occurring at least φ · n times in Σ.
2) contains no items occurring less than (φ − ϵ) · n times in Σ.
We can use the Reservoir sampling for solving the frequent items problem: how big m should be in order to find every frequent item of the stream in the m-sample?
Describe a bad-case scenario where this type of sampling is bad for this problem.
m >= 1/φ where φ is the frequency threshold parameter.
- *Study the proof on the slides!**
- α is a frequent item, appearing k times in Σ ⇒ k >= φn
- compute E[B] where B = #number of α in the m-sample
Bad case scenario:
- n items (n/2 is a, the other n/2 are distinct items)
- compute E[C} where C = # of non-freq. items in the m-sample
What is the best known algorithm for solving the ϵ-AFI (with n known)? Describe it.
Sticky sampling.
MAIN INGREDIENTS:
• Confidence parameter δ ∈ (0, 1).
• Hash Table S, whose entries are pairs (x, fe(x)) where
- x (the key) is an item
- fe(x) (the value) is an lower bound to the number of occurrences of x seen so far.
• Sampling rate chosen to ensure (with probability ≥ 1 − δ) the frequent items are in the sample.
Initialization:
• S ← empty Hash Table;
• r = ln(1/(δφ))/ϵ; // sampling rate = r/n.
For each xt in Σ do:
_______if (x, fe(x)) ∈ S then fe(x) ← fe(x) + 1;
_______else add (x, 1) to S with probability r/n; /* (start tracking x) */
At the end: return all items x in S with fe(x) ≥ (φ − ϵ)n
Prove that the sticky sampling solves the ϵ-AFI problem.
Study the proof on the slides!
• returns an element only if frequent or a bit less:
- OK (by construction)
• prob. that a freq. item is not returned:
- P[first ϵn items not sampled] <= e^(-ϵr)
- P[x not returned] <= δφ
- P[Sticky sampling fails] <= φ (use union bound)
What is the best known algorithm for solving the ϵ-AFI (with n NOT known)? Provide the pseudocode.
Initialization:
• S ← empty Hash Table;
• r = ln(1/(δφ))/ϵ;
For each xt ∈ Σ do:
____Let Bi the batch of xt;
____if xt is the first element of Bi then
________// Recalibration of S because a new batch has started…
________For each (x, fe(x)) ∈ S do:
________Let τx = number of tails before head of an unbiased coin.
________if fe(x) − τx > 0 then fe(x) = fe(x) − τx
________else delete (x, fe(x)) from S
____if (xt, fe(xt)) ∈ S then fe(xt) ← fe(xt) + 1;
____else add (xt, 1) to S with probability 1/2^i
What is a sketch?
Space-efficient data structure that can be used to provide (typically probabilistic) estimates of (statistical) characteristics of a data stream.
Give the definition of Frequency Moments.
For every k ≥ 0, the k-th frequency moment Fk of Σ is:
Fk = Σu∈U fu^k
where fu = |{j : xj = u, 1 ≤ j ≤ n}| i.e. number of occurrences of u in Σ (absolute frequency)
(We assume 0^0 = 0)
Describe the Probabilistic Counting algorithm ([Flajolet, Martin 1983]) for estimating F0 providing the pseudocode and the related theorem.
Main ingredients:
• Array C of ⌈log(|U| + 1)⌉ bits, all initialized to zero.
• Hash Function h : U → [0, .., |U| − 1]. We assume:
___ - For every u ∈ U, h(u) has uniform distribution in [0, .., |U| − 1]
___ - All h(u)’s are pairwise independent.
• Notation: for i ∈ [0, .., |U| − 1], define
___ - t(i) = number of trailing zeroes in binary representation of i, e.g., 12 = 1100 ⇒ t(12) = 2
Algorithm:
For each xj ∈ Σ do C[t(h(xj)] ← 1
After processing xn, estimate F0 as F˜0 = 2^R, where R is the largest index of C with C[R] = 1.
Theorem:
For a stream Σ of n elements, the probabilistic counting algorithm returns a value F˜0 such that, for any c > 2:
Pr(F˜0 < F0/c) ≤ 1/c and Pr(F˜0 > cF0) ≤ 1/c,
hence
Pr(F0/c ≤ F˜0 ≤ cF0) ≥ 1 − 2/c.
The algorithm requires a working memory of O(log |U|) bits, 1 pass, and O(log |U|) time per element.
Describe the Count-min sketch algorithm providing the pseudocode and the related theorem.
Main ingredients:
• matrix d × w array C of counters (O (log n) bits each)
• d hash functions: h1, h2, . . . , hd , with hj : U → {1, 2, . . . ,w}
for every j.
Algortihm:
Initialization: C[j, k] = 0, for every 1 ≤ j ≤ d and 1 ≤ k ≤ w.
For each xt in Σ do:
_____For 1 ≤ j ≤ d do C[j, hj(xt)] ← C[j, hj(xt)] + 1;
At the end of the stream: for any u ∈ U, its frequency fu can be estimated as:
f˜u = min 1 ≤ j ≤ d C[j, hj(u)].
Observe that f˜u ≥ fu, for every u, hence f˜u is a biased estimator!
Theorem:
Consider a d × w count-min sketch for a stream Σ of length n, where d = log2(1/δ) and w = 2/ϵ, for some δ, ϵ ∈ (0, 1). The sketch ensures that for any given u ∈ U occurring in Σ
f˜u − fu ≤ ϵ · n,
with probability ≥ 1 − δ.
- *Study the proof on the slides!**
- E[ C[j, hj(A)] ] = E[ f(A) + noise] = f(A) + n/w
Describe the Count sketch algorithm providing the pseudocode and the related theorem.
Main ingredients:
• matrix d × w array C of counters (O (log n) bits each)
• d hash functions: h1, h2, . . . , hd , with hj : U → {1, 2, . . . ,w}
• d hash functions: g1, g2, . . . , gd , with gj : U → {−1, +1}
for every j.
Algorithm:
Initialization: C[j, k] = 0, for every 1 ≤ j ≤ d and 1 ≤ k ≤ w.
For each xt in Σ do:
____For 1 ≤ j ≤ d do C[j, hj(xt)] ← C[j, hj(xt)] + gj(xt);
At the end of the stream: for any u ∈ U and 1 ≤ j ≤ d, let f˜u,j = gj(u) · C[j, hj(u)].
The frequency of u can be estimated as:
f˜u = median of the f˜u,j(s)
Theorem
Consider a d × w count sketch for a stream Σ of length n, where d = log2(1/δ) and w = O(1/ϵ^2), for some δ, ϵ ∈ (0, 1). The sketch ensures that for any given u ∈ U occurring in Σ:
• E[f˜u,j] = fu, for any j ∈ [1, d], i.e., f˜u,j is an unbiased estimator of fu;
• With probability ≥ 1 − δ,
|f˜u − fu| ≤ ϵ · sqrt(F2),
where F2 = Σ u∈U (fu)^2 (true second moment)
- *Study the proof on the slides!**
- Xw = 1 if h(w) = h(A)
- E[ C[j, hj(A)]*gj(A) ] = ….