3. Data Streams Flashcards
What are the 2 forms of queries for a data stream system?
- Ad-hoc queries: Normal queries asking one time about streams
- Standing queries: Queries that are asked about the stream at all times
What is the stream model?
A model of processing data where input elements enter at a rapid rate at one or more input ports
What types of problems can be solved using data streams?
- Queries over sliding windows
- Filtering data streams
- Estimating moments
What is the sliding window?
An approach for answering data stream queries which ask about a window of the N most recent elements received
What is the issue that can occur with sliding window in terms of data size?
- N may be too large to fit in memory or even on disk
- N could fit in disk but there are so many streams that windows for all cannot be stored
What is the issue with attempting to count the number of 1’s in a sliding window and how can we resolve it?
The issue is that we cannot afford to store N bits. We resolve this by using methods that provide us an approximate answer
What is the uniformity assumption for counting 1’s?
Assumes that 1’s and 0’s are distributed uniformly across the window so we can use earlier results to estimate
What is the equation when using the uniformity assumption for counting 1’s?
N * (S/(S+Z)) where S is the number of 1s from the beginning of the stream and Z is the number of zeros from the beginning of the stream
What is the DGIM method?
A method for counting 1’s that does not assume uniformity. It is never off by more than 50%
What is a DGIM bucket?
A record consisting of the timestamp of its end and the number of 1s between its beginning and its end. The DGIM method breaks the window into buckets where each bucket holds a number of 1s which is a power of 2
How can we represent a stream using DGIM buckets?
Start at the most recent elements. Create 1 or 2 buckets with the same power-of-2 number of 1s. The buckets do not overlap and buckets get increasingly bigger as we move to the left to less recent elements. Buckets disappear when their end-time is more than N time units in the past
How do we update DGIM buckets when a new bit arrives to the window?
- If it is 0 do nothing
- If it is 1, add it to the smallest bucket. If the bucket becomes too big or there are too many buckets of a certain size then combine buckets upwards until they are resorted
How can we estimate the number of 1’s in the most recent N bits using DGIM?
- Sum the sizes of all buckets but the last that does not fit in the window
- Add half the size of the last bucket
How can we estimate the number of 1’s in the most recent k bits where k < N?
- Find the earliest bucket B that overlaps with k
- Sum the sizes of all buckets earlier than B
- Add half the size of B
What is the goal of filtering data streams?
Determining which tuples of the stream are in a given list of keys S