1. Operational monitoring (temperatures, fan speeds in data center) 2. User activity on websites for fast personalized content 3. Automatically react to social media 4. IoT generates constant streams of data 5. Real-time bidding amongst advertising agents to decide which ad to show

Streaming Flashcards by Eric Sultini

Data streaming

processing continuously generated data in near-real time so organizations can react instantly to changes

How well did you know this?

Not at all

Perfectly

2 reasons data streaming is important in modern big data systems

Businesses need faster access to insights
Processing workload is spread more evenly over time

How well did you know this?

Not at all

Perfectly

5 common use cases

Operational monitoring (temperatures, fan speeds in data center)
User activity on websites for fast personalized content
Automatically react to social media
IoT generates constant streams of data
Real-time bidding amongst advertising agents to decide which ad to show

How well did you know this?

Not at all

Perfectly

Approximations

Only a portion of dataset can be kept in memory, so sometimes data has to be approximated if its too much

How well did you know this?

Not at all

Perfectly

3 types of Data Streaming Models

Time Series Model (state updates)
Cash Register Model (increment state)
Turnstile (state increase or decrease)

How well did you know this?

Not at all

Perfectly

4 main layers of streaming architecture (CADM)

Collection tier
Analysis tier
Data Access tier
<–message queuing–>

How well did you know this?

Not at all

Perfectly

2 other optional layers of streaming architecture

In-memory storage (supports analysis)
long-term storage (keep for batch in future)

How well did you know this?

Not at all

Perfectly

What is collection tier made of?

Multiple edge servers that receive data from external sources (Normally TCP/IP-based over HTTP protocol, a lot of JSON format)

How well did you know this?

Not at all

Perfectly

Producer-broker-consumer concept

Producer is collection tier
Broker is messaging queue (many across nodes)
Consumer is Analysis tier

How well did you know this?

Not at all

Perfectly

3 main types of message delivery semantics

Exactly once (message delivered and processed once)
At most once (maybe message lost, but never processed more than once)
At least once (some duplicates occur, none lost)

How well did you know this?

Not at all

Perfectly

Analysis tier

Heart of the streaming architecture that adopts continuous query model and design algorithms specific to streaming problem

How well did you know this?

Not at all

Perfectly

Continuous query model

query that is issued once then continuously executed against the data, often maintaining a state. (Results regularly pushed to client)

Security system - query filters sensor data for human movement

How well did you know this?

Not at all

Perfectly

Windowing

Carry out analyses on a per-window basis instead of a simple per-item basis

How well did you know this?

Not at all

Perfectly

Sliding windows

define interval of analyses based off time

How well did you know this?

Not at all

Perfectly

Fixed windows

analyze last 5 minutes of data every 5 minutes

How well did you know this?

Not at all

Perfectly

Overlapping windows

Study These Flashcards

analyze last 5 minutes of data every 2 minutes

Sampling windows

Study These Flashcards

analyze last 2 minutes of data every 5 minutes

taking samples, not all of it!

data-driven windows

Study These Flashcards

process only when session is active, and x time after it ends

don’t know the lengths ahead of time

Event time vs Stream time

Study These Flashcards

Event time - when event actually occurs (gold standard of windowing)
stream time - when event enters the streaming system

Time difference is called a skew

Watermark

Study These Flashcards

Captures progress of event time completeness as processing time progresses

Perfect watermark

Study These Flashcards

guarantees no late data ever arrives

Heuristic watermark

Study These Flashcards

estimates progress based on information available about the input stream (faster)

Allowed Lateness

Study These Flashcards

policy for accepting late data, since it is common

ex: allows 5 minute late data, everything else discarded

note - higher tolerance, longer data must be buffered

Accumulation strategy

Study These Flashcards

defines how intermediate results must be aggregated

Single-pass algorithms

once examined, items discarded ex: count number of elements

Approximated algorithms

Sampling and random projections for when one-pass algorithms don't work

3 options for data storage after its been analyzed

discard it, push back to streaming pipeline, save for future use

Data access tier

exposes analyzed data to consumer

4 common protocols for data collection and data access tiers

webhooks, HTTP, Server-sent events (SSE), websockets

Streaming Flashcards

(29 cards)