4.Mining data streams Flashcards
What is a data stream (example)? How can it be mined? What is the situation with memory?
A data stream can be for example a constant stream of data coming from a thermometer in the ocean. This can be mined with the help of a query that for example gives output if temperature crosses 25 degrees. The data can also be stored in a smaller temporary working memory (for example if you need to average data over a period) or a more permanent archive (Expensive to retrieve data from here).
How can we keep track of all the users belonging to 1/10th of the stream of users without actually having to keep a list of users in main memory to refer to?
Create 10 buckets and hash usernames to these buckets. We know a user is in the chosen 1/10th because they will always hash to that bucket. Thus we can for example just use all the queries for the same 1/10th of the users without a list. Note that the users aren’t actually saved in the buckets but rather the hash is used as a random number generator.
What does a bloom filter consist of? What are its purpose and how does it work?
It consists of: 1) An array of n bits, initially all 0’s.
2) A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n buckets, corresponding to the n bits of the bit-array.
3) A set S of m key values
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while rejecting most of the stream elements whose keys are not in S.
What is the probability of a bloom filter to produce a false positive?
(1-e^-(km/n))^k
k = number of hash functions
m = the number of members in S.
n = the number of bits in the array
What is the Flajolet-Martin Algorithm?
How does it work?
The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream.
It works by estimating the number of unique entries through the number of zeros ending a string (the tail). We let R be the max tail length seen so far. So a good estimate for the no of distinct elements is 2^R. The prob that a stream element ends in r 0’s is 2^-r and the prob of not finding a stream element with as many as r 0’s is e^-(m*2^r)