College 8: DataStreams Flashcards
1
Q
Characteristics of datastreams
A
- Incomplete Data: Not all data is available at once.
- High-Speed: Data arrives very quickly.
- Limited Storage: The system can only store a small part of the stream.
- Huge Volume: Continuous, possibly infinite data.
- One-Pass Algorithms: Can only scan data once.
- Non-Stationary Distribution: Requires fast, real-time processing.
2
Q
Handling data streams
A
- Online learning
- Sampling
- Windowing functions
3
Q
Online learning
A
Continuously update model with new data.
- Small updates: make small updates as new data arrives.
- Initial training: use initial data to train the model.
- Concept drift: model adapts to new patterns over time
4
Q
Sampling datastreams
A
- Fixed proportion sampling: store a fixed ratio of the data
- Random sampling: maintain a random sample of fixed size
- Reservoir sampling: store first s elements, for each new element, replace an old one with a certain probability
5
Q
Windowing models
A
- Sliding window: keep only most recent data items
- Tumbling window: process data in non-overlapping batches
- Hopping window: process overlapping data chunks
- Exponentially decaying window: give more weight to recent data
6
Q
Querying datastreams
A
- Filtering: select elements with specific properties
- Counting distinct elements: count unique elements in the stream
- Estimating moments: calculate statistics like average and std
- Finding frequent elements: identify popular items
7
Q
Filtering datastreams
A
- Bloom filters: efficiently test membership of elements in a set using a bit array. All set to 0 and using k independent hash functions
- Counting distinct elements: use approximate methods like the Flajolet-Martin algorithm
8
Q
A