C4 Flashcards

1
Q

conventional data mining

A
  • all data stored on slow hard disks
  • slow access, slow processing
  • offline analysis of data
  • instant action on results possible
    However, trained models can be deployed to work in real time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

the stream model

A
  • data enter the system at rapid rate, at one or more input ports
  • the system cannot store the entire stream accessibly
  • part of the data stream is stored in archival storae and another part is duplicated and stored in a limited working storage, to answer queries to gain insight

applications: eg. detecting breaking news, detecting traffic jams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

reservoir sampling

A

Suppose your RAM can store s records; each record has the same size

How could you sample an infinite stream of records, in such a way that at any moment your RAM keeps a random, uniformly distributed, sample of s records? (uniformly distributed: every record from the stream has the same chance of being kept in RAM, i.e., s/(#records seen so far)

  1. initialization: the first s records are stored in RAM (p = 1)
  2. inductive step:
    - when the (n+1)
    th element arrives, decide with probability s/(n+1) to keep the record in RAM (otherwise, ignore it)
    - if you choose to keep it, throw one of the previously stored record out, selected with equal probability, and use the freed space for the new record.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

fast filtering of bad records

A

Sometimes we know which records are “legal” and want to filter out all (or almost all) “illegal” records from a
stream => bloom filter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bloom filter

A

checking in a data base if a credit card number is valid costs too much time => use a hash function to hash every known legal card number, number x is legal, h(x) = 1 (just 1 bit)

using one hash function does have a high false positive rate of 11.75%, but using 2 hash functions already drops the false positive rate to 5%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly