Week 8-9: Data Streams and Streaming Algorithms Flashcards

1
Q

Data Stream

A

An ordered and potentially infinite sequence of elements.

Data streams have the following properties:
1. Potentially infinite (transient, single pass over the data, only summaries can be stored, real-time processing)
2. Non-static (incremental updates, concept drift, forgetting old data)
3. Temporal order may be important

Data streams can be modelled using server logs, a collection of small text files (great for online posts), and an ordered list of potentially infinite elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Element/Data Point

A

An element can include, but isn’t limited to:
1. Event
2. Figure
3. Record
4. Graph (in context of social network data streams)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Stream Example 1:

SELECT
System.Timestamp AS OutputTime, dspl AS SensorName,
Avg(temp) AS AvgTemperature

INTO
output

FROM
InputStream TIMESTAMP BY time

GROUP BY TumblingWindow(second,30),dspl

HAVING Avg(temp)>100

A

Monitor Avg(temp) every 30 seconds and display SensorName, Avg(temp) if AVg(temp) > 100 degrees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Stream Management System (DSMS)

A

It’s software that handles continuous data stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Processor

A

Streams enter the DSMS via the Processor, and often have different velocity (arrival rate) and data types. Inputs streams need not have the same number of elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Archival Storage

A

It archives streams. Archival Storage isn’t for querying due to very low efficiency of data retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Limited Working Storage

A

It serves as storage space in the disk or in the main memory. It’s used for answering queries. It can’t store all data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Standing Queries

A

They’re in the Processor and permanently execute and produce output at appropriate times.
Example: “output an alert ,when the variable TEMPERATURE exceeds 25)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ad-hoc Queries

A

These are asked once about the current state of a stream or streams. The output of the DSMS, which comes from the Processor, is the answers to the query.

Example: “list hte total number of transactions in the last hour”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Queries: Sampling Data from a Stream

A

Build a random sample, such as a random subset of the stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Queries: Sliding Windows

A

Find the number of items of type x in the last k elements of the stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Queries: Filtering a Data Stream

A

Select elements with property x from the stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Queries: Counting Distinct Elements

A

Number of distinct elements in the last k elements of the stream.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Queries: Estimating Frequency Moments

A

Estimate the average/standard deviation of the last k elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Queries: Finding Frequent Elements

A

Return the elements that appear most frequently in the stream. Can specify the k-most elements to return.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data Stream Query Applications

A
  1. Mining Query Streams: Google search wants to know what queries are more frequent today than yesterday.
  2. Mining Click Streams: websites want to know which of its pages are getting an unusual number of clicks in the past hour.
  3. Mining social network news feeds: e.g., looking for trending topics on Twitter, Facebook.
17
Q

Naïve Solution for Sampling

A

If we want to store a proportion r of the data stream, then sample proportion r of each user’s queries.

18
Q

Does the Naïve Solution for Sampling Work for Estimating the Proportion of Duplicate Queries in the Stream by an Average User?

A

No.

Assuming that 1/10 of the queries are sampled, then the fraction of duplicate queries based on the sample is d/(10x+19d).

The correct answer is d/(x+d)

A better solution is to select 1/10th of users, store all their queries, and count the duplicates. You can use a hash function h:user -> {1,2,…,10}. If h(user)-1, we accept their query. Otherwise, we discard it.

19
Q

Reservoir Sampling

A

The goal of reservoir sampling is to get a uniform sample of the stream.

Method:
1. Add the first s tuples from the stream into a reservoir R.
2. For j>s, with probability s/j replace a random entry of R with the j-th tuple of the stream.
3. At j=n, return R.

The result is that R contains each tuple seen so far with probability s/n.

20
Q

DGIM Method

A

This approach is used to approximate how many 1’s are in the last k bits, where k <= N. The Naïve Method is brute force and isn’t practical for large values of N.

DGIM Method doesn’t assume uniformity, is a bounded approximation error of O(1/r) (r is the maximum number of buckets for each size), and stores only O(log^2 N) (base 2) bits.

Note that the larger r is, the less the error, but the more bits we need to store.

21
Q

DGIM Method: Timestamp

A

EAch element has timestamp t mod N, where t is 1,2,3,… and N is the window length.

EAch timestamp needs O(log N) bits of space.

Memory is required to store a bucket. We have O(log N) buckets, which means O(log(N) * log(N)) = O(log^2 N) bits.

22
Q

DGIM Method: Bucket

A

A bucket is a segment of the window. It contains the timestamp of its most recent element and the number of 1s in it (size).

The right end of a bucket is always a 1. EVery 1 is in a bucket. No position is in more than 1 bucket. There are 1 or 2 buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets can’t increase in size as we move from left to right.

23
Q

DGIM Method: Pseudocode

A

For each new bit:

if bit == 1:

if end-time of current bucket is prior to N time units before the current time:
drop the last (oldest) bucket

else:
create a new bucket of size k (k starts at 1 and then increases by a factor of 2)
if there are now three buckets of size k:
combine the two oldest buckets into a bucket size of 2*k

if bit == 0:
No changes

Add the sizes of all the buckets, including half the size of the last bucket, which may be cut off for extending beyond N.

24
Q

Hash Function Filter

A

Given a set of keys S that we want to filter (i.e. white-list of emails):
1. Create a bit array B of n bits, initially all 0’s.
2. Choose a hash function h with range [0,n)
3. Hash each member of s \in S to one of n buckets, and set that bit to 1, i.e. B[h(s)]=1
4. Hash each element a of the stream with h and output a if it hashes to a bit of B that is 1 (i.e., Output a if B[h(a)] == 1)

If an element is in S, it will hash to a bucket that has its bit set to 1, so it’ll aways get through (no false negatives).

If an element isn’t in S, the chance that it gets through is 1 - e^{- |S|/n}, with n the range of the hash function.

25
Q

Bloom Filter

A

It’s a probabilistic set membership test. Bloom filter is faster than searching through S and smaller than the explicit representation. In theory, there are no chances for a false negative, but there are chances for a false positive.

Parameters:
1. |S|, size of stream. Increasing |S| leads to a higher false positive rate.
2. n, size of the bloom filter (array B), increasing n means more space, and a lower false positive rate.
3. k, number of hash functions, as k increases, there’s more computation. Note there’s usually a optimum value of k.

Given a set of keys S that we want to filter
1. Create a bit array B of n bits, initially all 0
2. Choose k hash functions h_1,…,h_k with range [0,n)
3. Hash each member of s \in S. Use k hash functions, h_i(s), i \in [1,k], which map s into random numbers uniformly in [1,n] (need module n if the hash function outputs large numbers). Set the elements in B[h_1(s)],…,B[h_k(s)] to 1
4. When a stream element with key y arrives, use k hash functions. If B[h_1(y)],…,B[h_k(y)] are all 1, output y. Else, discard y.

Note that a stream key might be different than any key s \in S, but still might get through if the combined bloom filter has a matching configuration of 1 bits.

26
Q

Bloom Filter: Collision

A

When each member s \in S is passed through the hash key, there’s a chance that one or more of the bits flipped to 1 are in the same position as the other 1 bits of other members.

27
Q

Bloom Filter: False Positive Rate

A

Assuming n»k, the approximation is p_{false} \approx (1 - e^{-k|s|/n})^k

Actual False Positive Rate: (1 - (1 - 1/n)^{k|S|})^k

28
Q

Bloom Filter: Optimal k (Number of Hash Functions)

A

k_{opt} = 0.618^{n/|S|}

29
Q

Flajolet-Martin (FM) Sketch

A

This method approximates the number of distinct elements in the last k elements of the stream.

Applications:
1. Security monitoring - if more than X attempts, report.
2. Propagation rate of viruses.
3. Distributed computing: multiple parties can combine their sketches to find the number of total distinct elements, and the number of common elements (inclusion/exclusion). Examples include document overlap for plagiarism.

30
Q

Bloom Filter: Applications

A

They’re used to reduce the disk lookups for non-existent rows of columns. Utilised by Google BigTable, Apache Hbase, Apache Cassandra, and Postgressql

31
Q

FM Sketch: Pseudocode

A
  1. Select a hash function h that maps each of N elements to at least log_2 N bits. N is the maximum number of distinct elements. There are no buckets.
  2. Define r(h(a)) as the number of 0’s from the right. e.g. a -> h(a) = 110 -> r(h(a)) = 2
  3. For each element x in stream S, compute r(h(x)). Let R = \underset{x \in S}{\max} r(h(x)). Return 2^R as the estimated number of distinct elements in S.

To reduce the error, implement the FM sketch multiple times, which changes the estimate to (2^{R1} + … + 2^{Rm})/m (standard deviation = \sigma/\sqrt{m}

It’s suggested to use the correction algorithm of \varphi = 0.77351, so the final estimate if (2^R)/\varphi

32
Q

FM Sketch: Error Bound

A

For c > 3, Pr(1/c <= (2^R)/F <= c) > 1 - 3/c. However, as c increases (which increases the range for the bound of 2^R), the error bound weakens.

33
Q

Alon-Matias-Szegedy (AMS) Method

A

This approach estimates the moments of the last k elements.

The k-th frequent moment of a stream comprised of N different types of elements a_1,…,a_N each appearing m_1,…,m_N times is defined as f_k = \sum_{i=1}^N m_i^k.

f_0 is the number of distinct elements. f_1 is the total frequency (length of the stream). f_2 shows how uneven the distribution is, as its the sum of the squares of the frequency of each distinct element.

34
Q

AMS Method: Pseudocode for f_2

A
  1. Pick some time t (t<n) to start.
  2. X.el = i, with i being the element at time t.
  3. X.val = c, with c being the frequency of X.el at time t and after.
  4. S_i = n * (2 * X_i.val - 1), with S_i being the f_2 estimate, and X_i.val being the frequency X.val.
  5. S = avg(S_1,…,S_k)

We can keep track of multiple X’s, up to k total. k <= n

35
Q

AMS Method: k-th Moment Estimate

A

For k >= 3, the estimate is n * (c^k - (c-1)^k)

36
Q

AMS Method Dealing with Non-Fixed n

A

Maintain as many X’s as the storage allows and replace them as the stream grows, utilising reservoir sampling. This is X’s near the beginning favour early positions, while X’s near the end might not have many different elements.