Streaming Flashcards

1
Q

Describe the streaming model and its metrics.

A

What we have:
• (Sequential) machine with limited amount of working memory
• Input provided as a continuous (one-way) stream:
Σ = x1, x2, . . . xn accessed/processed sequentially

Metrics:
• Metric 1: Size s of the working memory (aim: s ≪ n)
• Metric 2: Number p of sequential passes over Σ (aim: p = 1, in most cases you can read the input only 1 time)
• Metric 3: Processing time per item T (aim: T = O(1))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give the definition of the majority problem, describe the Boyer-Moore algorithm and provide its related theorem.

A

Problem:
Given a stream Σ = x1, x2, . . . , xn, return the element x (if any) that occurs > n/2 times in Σ.

Boyer-Moore algorithm:
Initialization: cand ← null, count ← 0.
For each xi in Σ do:
..........if count == 0 then:
....................cand ← xi
....................count ← 1
..........elif cand == xi:
....................count ← count + 1
..........else 
....................count ← count − 1
..........return cand
Theorem:
Given a stream Σ which contains a majority element m, the Boyer-Moore algorithm returns m using:
• working memory of size O (1);
• 1 pass;
• O (1) time per element
  • *Study the high-level proof on the slides!**
  • Define the sets Scand & Sopp
  • Use contraddiction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give the definition of m-sample (referring to the uniform sampling).
What is a problem of the uniform sampling?

A

Given a set X of n elements and an integer 1 ≤ m < n, an m-sample of X is a random subset S ⊂ X of size m, such that for each x ∈ Σ, we have Pr(x ∈ S) = m/n (uniform sampling).

Bad scenario for using the uniform sampling: unbounded streams, you don’t know n.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give the general definition of sampling problem.

A

Given a (possibly unbounded) stream Σ = x1, x2, . . . and an integer m < |Σ|, maintain, for every t ≥ m, an m-sample S of the prefix Σt = x1, x2, . . . , xt.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the Reservoir sampling algorithm and provide its related theorem.

A

Algorithm used to sample an unbounded stream.

Initialization: S ← ∅

for each xt in Σ do:
.......if t ≤ m then:
.............add xt to S
.......else (with probability m/t) do:
.............evict a random element x from S
.............add xt to S

Theorem:
Let Σ = x1, x2, …. For any time t ≥ m, the set S maintained by the reservoir sampling algorithm is an m-sample of Σt = x1, x2, . . . , xt.

  • *Study the proof on the slides!**
  • Proof by induction on t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give the definition of the frequent items problem and the ϵ-approximated version.

A

Problem:
Given a stream Σ = x1, x2, . . . , xn of n items and a frequency threshold φ ∈ (0, 1), determine all distinct items that occur at least φ·n times in Σ (we call them frequent items).

ϵ-Approximate Frequent Items (ϵ-AFI) problem:
Given the stream Σ = x1, x2, . . . , xn of n items, a frequency threshold φ ∈ (0, 1), and an accuracy parameter ϵ ∈ (0, φ), return a set of distinct items that:
1) includes all items occurring at least φ · n times in Σ.
2) contains no items occurring less than (φ − ϵ) · n times in Σ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

We can use the Reservoir sampling for solving the frequent items problem: how big m should be in order to find every frequent item of the stream in the m-sample?

Describe a bad-case scenario where this type of sampling is bad for this problem.

A

m >= 1/φ where φ is the frequency threshold parameter.

  • *Study the proof on the slides!**
  • α is a frequent item, appearing k times in Σ ⇒ k >= φn
  • compute E[B] where B = #number of α in the m-sample

Bad case scenario:

  • n items (n/2 is a, the other n/2 are distinct items)
  • compute E[C} where C = # of non-freq. items in the m-sample
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the best known algorithm for solving the ϵ-AFI (with n known)? Describe it.

A

Sticky sampling.

MAIN INGREDIENTS:
• Confidence parameter δ ∈ (0, 1).
• Hash Table S, whose entries are pairs (x, fe(x)) where
- x (the key) is an item
- fe(x) (the value) is an lower bound to the number of occurrences of x seen so far.
• Sampling rate chosen to ensure (with probability ≥ 1 − δ) the frequent items are in the sample.

Initialization:
• S ← empty Hash Table;
• r = ln(1/(δφ))/ϵ; // sampling rate = r/n.
For each xt in Σ do:
_______if (x, fe(x)) ∈ S then fe(x) ← fe(x) + 1;
_______else add (x, 1) to S with probability r/n; /* (start tracking x) */
At the end: return all items x in S with fe(x) ≥ (φ − ϵ)n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Prove that the sticky sampling solves the ϵ-AFI problem.

A

Study the proof on the slides!
• returns an element only if frequent or a bit less:
- OK (by construction)
• prob. that a freq. item is not returned:
- P[first ϵn items not sampled] <= e^(-ϵr)
- P[x not returned] <= δφ
- P[Sticky sampling fails] <= φ (use union bound)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the best known algorithm for solving the ϵ-AFI (with n NOT known)? Provide the pseudocode.

A

Initialization:
• S ← empty Hash Table;
• r = ln(1/(δφ))/ϵ;
For each xt ∈ Σ do:
____Let Bi the batch of xt;
____if xt is the first element of Bi then
________// Recalibration of S because a new batch has started…
________For each (x, fe(x)) ∈ S do:
________Let τx = number of tails before head of an unbiased coin.
________if fe(x) − τx > 0 then fe(x) = fe(x) − τx
________else delete (x, fe(x)) from S
____if (xt, fe(xt)) ∈ S then fe(xt) ← fe(xt) + 1;
____else add (xt, 1) to S with probability 1/2^i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a sketch?

A

Space-efficient data structure that can be used to provide (typically probabilistic) estimates of (statistical) characteristics of a data stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give the definition of Frequency Moments.

A

For every k ≥ 0, the k-th frequency moment Fk of Σ is:
Fk = ‎Σu∈U fu^k

where fu = |{j : xj = u, 1 ≤ j ≤ n}| i.e. number of occurrences of u in Σ (absolute frequency)

(We assume 0^0 = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the Probabilistic Counting algorithm ([Flajolet, Martin 1983]) for estimating F0 providing the pseudocode and the related theorem.

A

Main ingredients:
• Array C of ⌈log(|U| + 1)⌉ bits, all initialized to zero.
• Hash Function h : U → [0, .., |U| − 1]. We assume:
___ - For every u ∈ U, h(u) has uniform distribution in [0, .., |U| − 1]
___ - All h(u)’s are pairwise independent.
• Notation: for i ∈ [0, .., |U| − 1], define
___ - t(i) = number of trailing zeroes in binary representation of i, e.g., 12 = 1100 ⇒ t(12) = 2

Algorithm:
For each xj ∈ Σ do C[t(h(xj)] ← 1
After processing xn, estimate F0 as F˜0 = 2^R, where R is the largest index of C with C[R] = 1.

Theorem:
For a stream Σ of n elements, the probabilistic counting algorithm returns a value F˜0 such that, for any c > 2:
Pr(F˜0 < F0/c) ≤ 1/c and Pr(F˜0 > cF0) ≤ 1/c,
hence
Pr(F0/c ≤ F˜0 ≤ cF0) ≥ 1 − 2/c.
The algorithm requires a working memory of O(log |U|) bits, 1 pass, and O(log |U|) time per element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe the Count-min sketch algorithm providing the pseudocode and the related theorem.

A

Main ingredients:
• matrix d × w array C of counters (O (log n) bits each)
• d hash functions: h1, h2, . . . , hd , with hj : U → {1, 2, . . . ,w}
for every j.

Algortihm:
Initialization: C[j, k] = 0, for every 1 ≤ j ≤ d and 1 ≤ k ≤ w.

For each xt in Σ do:
_____For 1 ≤ j ≤ d do C[j, hj(xt)] ← C[j, hj(xt)] + 1;

At the end of the stream: for any u ∈ U, its frequency fu can be estimated as:
f˜u = min 1 ≤ j ≤ d C[j, hj(u)].

Observe that f˜u ≥ fu, for every u, hence f˜u is a biased estimator!

Theorem:
Consider a d × w count-min sketch for a stream Σ of length n, where d = log2(1/δ) and w = 2/ϵ, for some δ, ϵ ∈ (0, 1). The sketch ensures that for any given u ∈ U occurring in Σ
f˜u − fu ≤ ϵ · n,
with probability ≥ 1 − δ.

  • *Study the proof on the slides!**
  • E[ C[j, hj(A)] ] = E[ f(A) + noise] = f(A) + n/w
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the Count sketch algorithm providing the pseudocode and the related theorem.

A

Main ingredients:
• matrix d × w array C of counters (O (log n) bits each)
• d hash functions: h1, h2, . . . , hd , with hj : U → {1, 2, . . . ,w}
• d hash functions: g1, g2, . . . , gd , with gj : U → {−1, +1}
for every j.

Algorithm:
Initialization: C[j, k] = 0, for every 1 ≤ j ≤ d and 1 ≤ k ≤ w.

For each xt in Σ do:
____For 1 ≤ j ≤ d do C[j, hj(xt)] ← C[j, hj(xt)] + gj(xt);

At the end of the stream: for any u ∈ U and 1 ≤ j ≤ d, let f˜u,j = gj(u) · C[j, hj(u)].

The frequency of u can be estimated as:
f˜u = median of the f˜u,j(s)

Theorem
Consider a d × w count sketch for a stream Σ of length n, where d = log2(1/δ) and w = O(1/ϵ^2), for some δ, ϵ ∈ (0, 1). The sketch ensures that for any given u ∈ U occurring in Σ:
• E[f˜u,j] = fu, for any j ∈ [1, d], i.e., f˜u,j is an unbiased estimator of fu;
• With probability ≥ 1 − δ,
|f˜u − fu| ≤ ϵ · sqrt(F2),
where F2 = Σ u∈U (fu)^2 (true second moment)

  • *Study the proof on the slides!**
  • Xw = 1 if h(w) = h(A)
  • E[ C[j, hj(A)]*gj(A) ] = ….
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Algorithm for estimating F2 and related theorem.

A

Algorithm:
Given a d × w count sketch for Σ, define
F˜2,j = Σ k=1 to w (C[j, k])^2.

We can derive the following estimate for the true second moment F2:
F˜2 = median of the F˜2,j(s)

Theorem:
Consider a d × w count-min sketch for a stream Σ of length n, where
d = log2(1/δ) and w = O(1/ϵ^2), for some δ, ϵ ∈ (0, 1). The sketch ensures that:
• E[F˜2,j] = F2, for any j ∈ [1, d], i.e., F˜2,j is an unbiased estimator of F2;
• With probability ≥ 1 − δ,
|F˜2 − F2| ≤ ϵ · sqrt(F2)

17
Q

What is filtering used for?

A

For many applications, processing a data stream Σ = x1, x2, . . . entails essentially the identification of the xi’s which meet a certain criterion.

18
Q

Give the definition of the approximate membership problem.

A

Given a stream Σ = x1, x2, . . . of elements from some universe U, and let S be a set of m elements from U. Store S into a compact data structure that, for any given xi, allows to check whether xi ∈ S with
• no error, when xi ∈ S (No false negatives)
• small probability error, when xi !∈ S (Small false positive rate)

19
Q

Describe the Bloom filter algorithm for solving the approximate membership problem and the related theorem.

A

Main ingredients:
• Array A of n bits, all initially 0.
• k hash functions: h1, h2, . . . , hk , with hj : U → {0, 1, . . . , n − 1} for every 1 ≤ j ≤ k

Algorithm:
For each e ∈ S do:
____For 1 ≤ j ≤ k do A[hj(e)] ← 1;

Membership test:
for any xi in Σ:
____if xi ∈ S ⇔ h1(xi) = h2(xi) = · · · = hk (xi) = 1

Theorem:
Suppose that n is sufficiently large.
For any given xi which does not belong to S, the probability that xi is erroneously claimed to be in S is
Pr(hj(xi) = 1 for each 1 ≤ j ≤ k) ≃ (1 − e^(−km/n))^k
This probability is referred to as false positive rate.