Streaming Flashcards
Streaming Model
(Sequential) machine with limited amount of working memory
- Input provided as a continuous (one-way) stream.
2 tasks:
Update: what to do wen new elements arrive
Query: How to answer to a specific request
The Model
-Input stream Σ = x1, x2, . . . xn . . . received sequentially.
-Metric 1: Size s of the working memory (aim: s «_space;|Σ|)
-Metric 2: Number p of sequential passes over Σ (aim: p = 1)
-Metric 3: Processing time per item T (aim: T = O(1))
Boyer-Moore algorithm
The Boyer-Moore algorithm maintains 2 integers cand and count:
* cand contains a candidate majority element
(the true majority element, if one exists);
* count is a counter;
Initialization: cand ← null and count ← 0.
For each xt
in Σ do
if count = 0 then {cand ← xt
; count ← 1;}
else {
if cand = xt then count ← count + 1
else count ← count − 1
}
At the end: return cand
Sampling
Definition
Given a set X of n elements and an integer 1 ≤ m < n, an
m-sample of X is a random subset S ⊂ X of size m, such that for
each x ∈ X, we have Pr(x ∈ S) = m/n (uniform sampling).
Sampling problem
Given a (possibly unbounded) stream Σ = x1, x2, . . . and an
integer m < |Σ|, maintain, for every t ≥ m, an m-sample S of the
prefix Σt = x1, x2, . . . , xt.
For a fixed t, the solution is easy
* Before the stream starts, compute an m-sample I of the set
of integers {1, . . .t}.
* For each entry xi ∈ Σt with i ∈ I, add xi to S.
.
Reservoir Sampling: algorithm
Initialization: S ← ∅.
For each xt
in Σ do
if t ≤ m then add xt to S;
else with probability m/t do the following: {
evict a random element x from S
add xt to S
}
Frequent Items with Reservoir sampling
Frequent items can be approximated through reservoir sampling:
1 Extract an m-sample S from Σ with reservoir sampling. Note that
the same element may occurr multiple times in the sample!
2 Return the subset S
0 ⊆ S of distinct items in the sample
How well does S
0 approximate the set of frequent items?
* If m ≥ 1/ϕ, for any given frequent item a, we expect to find at least
one copy of a in S (hence in S
0
).
* However, some frequent item may be missed, and S
0 may contain
items with very low frequency.
Therefore: this approach does not properly solve the -AFI problem.
Sticky Sampling
Sticky Sampling provides a probabilistic solution to the -AFI problem.
MAIN INGREDIENTS:
* Confidence parameter δ ∈ (0, 1).
* Hash Table S, whose entries are pairs (x, fe (x)) where
* x (the key) is an item
* fe (x) (the value) is an lower bound to the number of
occurrences of x seen so far.
* Sampling rate chosen to ensure (with probability ≥ 1 − δ) the
frequent items are in the sample.
Sketch
Sketch: space-efficient data structure that can be
used to provide (typically probabilistic) estimates of
(statistical) characteristics of a data stream.
Consider a stream Σ = x1, x2, . . . , xn, whose elements belong to a
universe U.
For each u ∈ U occurring in Σ define its (absolute) frequency
fu = |{j : xj = u, 1 ≤ j ≤ n}| ,
i.e., the number of occurrences of u in Σ
Definition: Frequency Moments
For every k ≥ 0, the k-th frequency moment Fk of Σ is
Fk = SUM fu ^k
(We assume 0
0 = 0)
- F0 = number of distinct items in Σ. It is useful, for instance,
in the analysis of web-logs. - F1 = |Σ|: trivial to maintain with a counter.
- 1 − F2/|Σ|
2 = Gini index of Σ. It provides info on data skew:
the closer to 0 and the higher is the skew. It is used for
decision tree inductio
Estimating individual frequencies and F2