C4 Flashcards

Question 1

Q

conventional data mining

Answer

A

all data stored on slow hard disks
slow access, slow processing
offline analysis of data
instant action on results possible
However, trained models can be deployed to work in real time

Question 2

Q

the stream model

Answer

A

data enter the system at rapid rate, at one or more input ports
the system cannot store the entire stream accessibly
part of the data stream is stored in archival storae and another part is duplicated and stored in a limited working storage, to answer queries to gain insight

applications: eg. detecting breaking news, detecting traffic jams

Question 3

Q

reservoir sampling

Answer

A

Suppose your RAM can store s records; each record has the same size

How could you sample an infinite stream of records, in such a way that at any moment your RAM keeps a random, uniformly distributed, sample of s records? (uniformly distributed: every record from the stream has the same chance of being kept in RAM, i.e., s/(#records seen so far)

initialization: the first s records are stored in RAM (p = 1)
inductive step:
- when the (n+1)
th element arrives, decide with probability s/(n+1) to keep the record in RAM (otherwise, ignore it)
- if you choose to keep it, throw one of the previously stored record out, selected with equal probability, and use the freed space for the new record.

Question 4

Q

fast filtering of bad records

Answer

A

Sometimes we know which records are “legal” and want to filter out all (or almost all) “illegal” records from a
stream => bloom filter

Question 5

Q

Bloom filter

Answer

A

checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)

using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%