Week 8-9: Data Streams and Streaming Algorithms Flashcards
Data Stream
An ordered and potentially infinite sequence of elements.
Data streams have the following properties:
1. Potentially infinite (transient, single pass over the data, only summaries can be stored, real-time processing)
2. Non-static (incremental updates, concept drift, forgetting old data)
3. Temporal order may be important
Data streams can be modelled using server logs, a collection of small text files (great for online posts), and an ordered list of potentially infinite elements.
Element/Data Point
An element can include, but isn’t limited to:
1. Event
2. Figure
3. Record
4. Graph (in context of social network data streams)
Data Stream Example 1:
SELECT
System.Timestamp AS OutputTime, dspl AS SensorName,
Avg(temp) AS AvgTemperature
INTO
output
FROM
InputStream TIMESTAMP BY time
GROUP BY TumblingWindow(second,30),dspl
HAVING Avg(temp)>100
Monitor Avg(temp) every 30 seconds and display SensorName, Avg(temp) if AVg(temp) > 100 degrees.
Data Stream Management System (DSMS)
It’s software that handles continuous data stream.
Processor
Streams enter the DSMS via the Processor, and often have different velocity (arrival rate) and data types. Inputs streams need not have the same number of elements.
Archival Storage
It archives streams. Archival Storage isn’t for querying due to very low efficiency of data retrieval.
Limited Working Storage
It serves as storage space in the disk or in the main memory. It’s used for answering queries. It can’t store all data.
Standing Queries
They’re in the Processor and permanently execute and produce output at appropriate times.
Example: “output an alert ,when the variable TEMPERATURE exceeds 25)
Ad-hoc Queries
These are asked once about the current state of a stream or streams. The output of the DSMS, which comes from the Processor, is the answers to the query.
Example: “list hte total number of transactions in the last hour”
Queries: Sampling Data from a Stream
Build a random sample, such as a random subset of the stream.
Queries: Sliding Windows
Find the number of items of type x in the last k elements of the stream.
Queries: Filtering a Data Stream
Select elements with property x from the stream.
Queries: Counting Distinct Elements
Number of distinct elements in the last k elements of the stream.
Queries: Estimating Frequency Moments
Estimate the average/standard deviation of the last k elements.
Queries: Finding Frequent Elements
Return the elements that appear most frequently in the stream. Can specify the k-most elements to return.
Data Stream Query Applications
- Mining Query Streams: Google search wants to know what queries are more frequent today than yesterday.
- Mining Click Streams: websites want to know which of its pages are getting an unusual number of clicks in the past hour.
- Mining social network news feeds: e.g., looking for trending topics on Twitter, Facebook.
Naïve Solution for Sampling
If we want to store a proportion r of the data stream, then sample proportion r of each user’s queries.
Does the Naïve Solution for Sampling Work for Estimating the Proportion of Duplicate Queries in the Stream by an Average User?
No.
Assuming that 1/10 of the queries are sampled, then the fraction of duplicate queries based on the sample is d/(10x+19d).
The correct answer is d/(x+d)
A better solution is to select 1/10th of users, store all their queries, and count the duplicates. You can use a hash function h:user -> {1,2,…,10}. If h(user)-1, we accept their query. Otherwise, we discard it.
Reservoir Sampling
The goal of reservoir sampling is to get a uniform sample of the stream.
Method:
1. Add the first s tuples from the stream into a reservoir R.
2. For j>s, with probability s/j replace a random entry of R with the j-th tuple of the stream.
3. At j=n, return R.
The result is that R contains each tuple seen so far with probability s/n.
DGIM Method
This approach is used to approximate how many 1’s are in the last k bits, where k <= N. The Naïve Method is brute force and isn’t practical for large values of N.
DGIM Method doesn’t assume uniformity, is a bounded approximation error of O(1/r) (r is the maximum number of buckets for each size), and stores only O(log^2 N) (base 2) bits.
Note that the larger r is, the less the error, but the more bits we need to store.
DGIM Method: Timestamp
EAch element has timestamp t mod N, where t is 1,2,3,… and N is the window length.
EAch timestamp needs O(log N) bits of space.
Memory is required to store a bucket. We have O(log N) buckets, which means O(log(N) * log(N)) = O(log^2 N) bits.
DGIM Method: Bucket
A bucket is a segment of the window. It contains the timestamp of its most recent element and the number of 1s in it (size).
The right end of a bucket is always a 1. EVery 1 is in a bucket. No position is in more than 1 bucket. There are 1 or 2 buckets of any given size, up to some maximum size. All sizes must be a power of 2. Buckets can’t increase in size as we move from left to right.
DGIM Method: Pseudocode
For each new bit:
if bit == 1:
if end-time of current bucket is prior to N time units before the current time:
drop the last (oldest) bucket
else:
create a new bucket of size k (k starts at 1 and then increases by a factor of 2)
if there are now three buckets of size k:
combine the two oldest buckets into a bucket size of 2*k
if bit == 0:
No changes
Add the sizes of all the buckets, including half the size of the last bucket, which may be cut off for extending beyond N.
Hash Function Filter
Given a set of keys S that we want to filter (i.e. white-list of emails):
1. Create a bit array B of n bits, initially all 0’s.
2. Choose a hash function h with range [0,n)
3. Hash each member of s \in S to one of n buckets, and set that bit to 1, i.e. B[h(s)]=1
4. Hash each element a of the stream with h and output a if it hashes to a bit of B that is 1 (i.e., Output a if B[h(a)] == 1)
If an element is in S, it will hash to a bucket that has its bit set to 1, so it’ll aways get through (no false negatives).
If an element isn’t in S, the chance that it gets through is 1 - e^{- |S|/n}, with n the range of the hash function.
Bloom Filter
It’s a probabilistic set membership test. Bloom filter is faster than searching through S and smaller than the explicit representation. In theory, there are no chances for a false negative, but there are chances for a false positive.
Parameters:
1. |S|, size of stream. Increasing |S| leads to a higher false positive rate.
2. n, size of the bloom filter (array B), increasing n means more space, and a lower false positive rate.
3. k, number of hash functions, as k increases, there’s more computation. Note there’s usually a optimum value of k.
Given a set of keys S that we want to filter
1. Create a bit array B of n bits, initially all 0
2. Choose k hash functions h_1,…,h_k with range [0,n)
3. Hash each member of s \in S. Use k hash functions, h_i(s), i \in [1,k], which map s into random numbers uniformly in [1,n] (need module n if the hash function outputs large numbers). Set the elements in B[h_1(s)],…,B[h_k(s)] to 1
4. When a stream element with key y arrives, use k hash functions. If B[h_1(y)],…,B[h_k(y)] are all 1, output y. Else, discard y.
Note that a stream key might be different than any key s \in S, but still might get through if the combined bloom filter has a matching configuration of 1 bits.
Bloom Filter: Collision
When each member s \in S is passed through the hash key, there’s a chance that one or more of the bits flipped to 1 are in the same position as the other 1 bits of other members.
Bloom Filter: False Positive Rate
Assuming n»k, the approximation is p_{false} \approx (1 - e^{-k|s|/n})^k
Actual False Positive Rate: (1 - (1 - 1/n)^{k|S|})^k
Bloom Filter: Optimal k (Number of Hash Functions)
k_{opt} = 0.618^{n/|S|}
Flajolet-Martin (FM) Sketch
This method approximates the number of distinct elements in the last k elements of the stream.
Applications:
1. Security monitoring - if more than X attempts, report.
2. Propagation rate of viruses.
3. Distributed computing: multiple parties can combine their sketches to find the number of total distinct elements, and the number of common elements (inclusion/exclusion). Examples include document overlap for plagiarism.
Bloom Filter: Applications
They’re used to reduce the disk lookups for non-existent rows of columns. Utilised by Google BigTable, Apache Hbase, Apache Cassandra, and Postgressql
FM Sketch: Pseudocode
- Select a hash function h that maps each of N elements to at least log_2 N bits. N is the maximum number of distinct elements. There are no buckets.
- Define r(h(a)) as the number of 0’s from the right. e.g. a -> h(a) = 110 -> r(h(a)) = 2
- For each element x in stream S, compute r(h(x)). Let R = \underset{x \in S}{\max} r(h(x)). Return 2^R as the estimated number of distinct elements in S.
To reduce the error, implement the FM sketch multiple times, which changes the estimate to (2^{R1} + … + 2^{Rm})/m (standard deviation = \sigma/\sqrt{m}
It’s suggested to use the correction algorithm of \varphi = 0.77351, so the final estimate if (2^R)/\varphi
FM Sketch: Error Bound
For c > 3, Pr(1/c <= (2^R)/F <= c) > 1 - 3/c. However, as c increases (which increases the range for the bound of 2^R), the error bound weakens.
Alon-Matias-Szegedy (AMS) Method
This approach estimates the moments of the last k elements.
The k-th frequent moment of a stream comprised of N different types of elements a_1,…,a_N each appearing m_1,…,m_N times is defined as f_k = \sum_{i=1}^N m_i^k.
f_0 is the number of distinct elements. f_1 is the total frequency (length of the stream). f_2 shows how uneven the distribution is, as its the sum of the squares of the frequency of each distinct element.
AMS Method: Pseudocode for f_2
- Pick some time t (t<n) to start.
- X.el = i, with i being the element at time t.
- X.val = c, with c being the frequency of X.el at time t and after.
- S_i = n * (2 * X_i.val - 1), with S_i being the f_2 estimate, and X_i.val being the frequency X.val.
- S = avg(S_1,…,S_k)
We can keep track of multiple X’s, up to k total. k <= n
AMS Method: k-th Moment Estimate
For k >= 3, the estimate is n * (c^k - (c-1)^k)
AMS Method Dealing with Non-Fixed n
Maintain as many X’s as the storage allows and replace them as the stream grows, utilising reservoir sampling. This is X’s near the beginning favour early positions, while X’s near the end might not have many different elements.