Streaming Flashcards
Data streaming
processing continuously generated data in near-real time so organizations can react instantly to changes
2 reasons data streaming is important in modern big data systems
- Businesses need faster access to insights
- Processing workload is spread more evenly over time
5 common use cases
- Operational monitoring (temperatures, fan speeds in data center)
- User activity on websites for fast personalized content
- Automatically react to social media
- IoT generates constant streams of data
- Real-time bidding amongst advertising agents to decide which ad to show
Approximations
Only a portion of dataset can be kept in memory, so sometimes data has to be approximated if its too much
3 types of Data Streaming Models
- Time Series Model (state updates)
- Cash Register Model (increment state)
- Turnstile (state increase or decrease)
4 main layers of streaming architecture (CADM)
Collection tier
Analysis tier
Data Access tier
<–message queuing–>
2 other optional layers of streaming architecture
In-memory storage (supports analysis)
long-term storage (keep for batch in future)
What is collection tier made of?
Multiple edge servers that receive data from external sources (Normally TCP/IP-based over HTTP protocol, a lot of JSON format)
Producer-broker-consumer concept
Producer is collection tier
Broker is messaging queue (many across nodes)
Consumer is Analysis tier
3 main types of message delivery semantics
- Exactly once (message delivered and processed once)
- At most once (maybe message lost, but never processed more than once)
- At least once (some duplicates occur, none lost)
Analysis tier
Heart of the streaming architecture that adopts continuous query model and design algorithms specific to streaming problem
Continuous query model
query that is issued once then continuously executed against the data, often maintaining a state. (Results regularly pushed to client)
Security system - query filters sensor data for human movement
Windowing
Carry out analyses on a per-window basis instead of a simple per-item basis
Sliding windows
define interval of analyses based off time
Fixed windows
analyze last 5 minutes of data every 5 minutes
Overlapping windows
analyze last 5 minutes of data every 2 minutes
Sampling windows
analyze last 2 minutes of data every 5 minutes
taking samples, not all of it!
data-driven windows
process only when session is active, and x time after it ends
don’t know the lengths ahead of time
Event time vs Stream time
Event time - when event actually occurs (gold standard of windowing)
stream time - when event enters the streaming system
Time difference is called a skew
Watermark
Captures progress of event time completeness as processing time progresses
Perfect watermark
guarantees no late data ever arrives
Heuristic watermark
estimates progress based on information available about the input stream (faster)
Allowed Lateness
policy for accepting late data, since it is common
ex: allows 5 minute late data, everything else discarded
note - higher tolerance, longer data must be buffered
Accumulation strategy
defines how intermediate results must be aggregated
Single-pass algorithms
once examined, items discarded
ex: count number of elements
Approximated algorithms
Sampling and random projections for when one-pass algorithms don’t work
3 options for data storage after its been analyzed
discard it, push back to streaming pipeline, save for future use
Data access tier
exposes analyzed data to consumer
4 common protocols for data collection and data access tiers
webhooks, HTTP, Server-sent events (SSE), websockets