Streaming Flashcards
Data streaming
processing continuously generated data in near-real time so organizations can react instantly to changes
2 reasons data streaming is important in modern big data systems
- Businesses need faster access to insights
- Processing workload is spread more evenly over time
5 common use cases
- Operational monitoring (temperatures, fan speeds in data center)
- User activity on websites for fast personalized content
- Automatically react to social media
- IoT generates constant streams of data
- Real-time bidding amongst advertising agents to decide which ad to show
Approximations
Only a portion of dataset can be kept in memory, so sometimes data has to be approximated if its too much
3 types of Data Streaming Models
- Time Series Model (state updates)
- Cash Register Model (increment state)
- Turnstile (state increase or decrease)
4 main layers of streaming architecture (CADM)
Collection tier
Analysis tier
Data Access tier
<–message queuing–>
2 other optional layers of streaming architecture
In-memory storage (supports analysis)
long-term storage (keep for batch in future)
What is collection tier made of?
Multiple edge servers that receive data from external sources (Normally TCP/IP-based over HTTP protocol, a lot of JSON format)
Producer-broker-consumer concept
Producer is collection tier
Broker is messaging queue (many across nodes)
Consumer is Analysis tier
3 main types of message delivery semantics
- Exactly once (message delivered and processed once)
- At most once (maybe message lost, but never processed more than once)
- At least once (some duplicates occur, none lost)
Analysis tier
Heart of the streaming architecture that adopts continuous query model and design algorithms specific to streaming problem
Continuous query model
query that is issued once then continuously executed against the data, often maintaining a state. (Results regularly pushed to client)
Security system - query filters sensor data for human movement
Windowing
Carry out analyses on a per-window basis instead of a simple per-item basis
Sliding windows
define interval of analyses based off time
Fixed windows
analyze last 5 minutes of data every 5 minutes