3. Data Streams Flashcards

1
Q

What are the 2 forms of queries for a data stream system?

A
  1. Ad-hoc queries: Normal queries asking one time about streams
  2. Standing queries: Queries that are asked about the stream at all times
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the stream model?

A

A model of processing data where input elements enter at a rapid rate at one or more input ports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What types of problems can be solved using data streams?

A
  1. Queries over sliding windows
  2. Filtering data streams
  3. Estimating moments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the sliding window?

A

An approach for answering data stream queries which ask about a window of the N most recent elements received

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the issue that can occur with sliding window in terms of data size?

A
  1. N may be too large to fit in memory or even on disk
  2. N could fit in disk but there are so many streams that windows for all cannot be stored
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the issue with attempting to count the number of 1’s in a sliding window and how can we resolve it?

A

The issue is that we cannot afford to store N bits. We resolve this by using methods that provide us an approximate answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the uniformity assumption for counting 1’s?

A

Assumes that 1’s and 0’s are distributed uniformly across the window so we can use earlier results to estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the equation when using the uniformity assumption for counting 1’s?

A

N * (S/(S+Z)) where S is the number of 1s from the beginning of the stream and Z is the number of zeros from the beginning of the stream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the DGIM method?

A

A method for counting 1’s that does not assume uniformity. It is never off by more than 50%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a DGIM bucket?

A

A record consisting of the timestamp of its end and the number of 1s between its beginning and its end. The DGIM method breaks the window into buckets where each bucket holds a number of 1s which is a power of 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can we represent a stream using DGIM buckets?

A

Start at the most recent elements. Create 1 or 2 buckets with the same power-of-2 number of 1s. The buckets do not overlap and buckets get increasingly bigger as we move to the left to less recent elements. Buckets disappear when their end-time is more than N time units in the past

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do we update DGIM buckets when a new bit arrives to the window?

A
  1. If it is 0 do nothing
  2. If it is 1, add it to the smallest bucket. If the bucket becomes too big or there are too many buckets of a certain size then combine buckets upwards until they are resorted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can we estimate the number of 1’s in the most recent N bits using DGIM?

A
  1. Sum the sizes of all buckets but the last that does not fit in the window
  2. Add half the size of the last bucket
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can we estimate the number of 1’s in the most recent k bits where k < N?

A
  1. Find the earliest bucket B that overlaps with k
  2. Sum the sizes of all buckets earlier than B
  3. Add half the size of B
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the goal of filtering data streams?

A

Determining which tuples of the stream are in a given list of keys S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why can a hash table not be used to filter a data stream?

A

We do not have enough memory to store the entire list of keys we are looking for as we could be processing millions of filters on the same stream

17
Q

What is a bloom filter?

A

A technique for filtering data streams made up of an array of bits with a number of hash functions

18
Q

How is a bloom filter used to detect if elements have been seen before?

A
  1. All bits are initially set to 0
  2. When an input arrives, we set to 1 the bits h(x) for each hash function h
  3. If all of the bit positions are already 1 we say we have seen y before
19
Q

What is the issue with a bloom filter?

A

It can give false positives as hash collisions can occur

20
Q

What is the probability of a false positive in a bloom filter?

A

Depends on density of 1’s and the number of hash functions
=(fraction of 1’s)^# of hash functions

21
Q

What is the goal of estimating moments of a data stream?

A

Estimating summary statistics like average and standard deviation of the last k elements

22
Q

What is the formula for the kth moment?

A

Sum of all (mi)^k for all i in A where A is a set of N values and mi is the number of times i occurs in the stream

23
Q

What are the special case moments?

A
  1. 0th moment which is the number of distinct elements
  2. 1st moment which is the length of the stream
  3. 2nd moment (surprise moment) which is a measure of how uneven the distribution is
24
Q

What is AMS?

A

A method for estimating moments of a stream

25
Q

What is the formula for the second moment for a particular variable X of the stream?

A

f(X) = n(2c-1) where n is the length of the stream and c is the count of element i stored in X.val

26
Q

How do we estimate the second moment using AMS when keeping track of multiple variables?

A

Take the average of f(X) = n(2c-1) for all X that we are tracking

27
Q

What is the issue with the AMS method for estimating the second moment?

A

We assumed there was a length of the stream n but real streams go on forever