C4 Flashcards
conventional data mining
- all data stored on slow hard disks
- slow access, slow processing
- offline analysis of data
- instant action on results possible
However, trained models can be deployed to work in real time
the stream model
- data enter the system at rapid rate, at one or more input ports
- the system cannot store the entire stream accessibly
- part of the data stream is stored in archival storae and another part is duplicated and stored in a limited working storage, to answer queries to gain insight
applications: eg. detecting breaking news, detecting traffic jams
reservoir sampling
Suppose your RAM can store s records; each record has the same size
How could you sample an infinite stream of records, in such a way that at any moment your RAM keeps a random, uniformly distributed, sample of s records? (uniformly distributed: every record from the stream has the same chance of being kept in RAM, i.e., s/(#records seen so far)
- initialization: the first s records are stored in RAM (p = 1)
- inductive step:
- when the (n+1)
th element arrives, decide with probability s/(n+1) to keep the record in RAM (otherwise, ignore it)
- if you choose to keep it, throw one of the previously stored record out, selected with equal probability, and use the freed space for the new record.
fast filtering of bad records
Sometimes we know which records are “legal” and want to filter out all (or almost all) “illegal” records from a
stream => bloom filter
Bloom filter
checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)
using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%