Streaming Flashcards

1
Q

Data streaming

A

processing continuously generated data in near-real time so organizations can react instantly to changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 reasons data streaming is important in modern big data systems

A
  1. Businesses need faster access to insights
  2. Processing workload is spread more evenly over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

5 common use cases

A
  1. Operational monitoring (temperatures, fan speeds in data center)
  2. User activity on websites for fast personalized content
  3. Automatically react to social media
  4. IoT generates constant streams of data
  5. Real-time bidding amongst advertising agents to decide which ad to show
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Approximations

A

Only a portion of dataset can be kept in memory, so sometimes data has to be approximated if its too much

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 types of Data Streaming Models

A
  1. Time Series Model (state updates)
  2. Cash Register Model (increment state)
  3. Turnstile (state increase or decrease)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

4 main layers of streaming architecture (CADM)

A

Collection tier
Analysis tier
Data Access tier
<–message queuing–>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

2 other optional layers of streaming architecture

A

In-memory storage (supports analysis)
long-term storage (keep for batch in future)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is collection tier made of?

A

Multiple edge servers that receive data from external sources (Normally TCP/IP-based over HTTP protocol, a lot of JSON format)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Producer-broker-consumer concept

A

Producer is collection tier
Broker is messaging queue (many across nodes)
Consumer is Analysis tier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

3 main types of message delivery semantics

A
  1. Exactly once (message delivered and processed once)
  2. At most once (maybe message lost, but never processed more than once)
  3. At least once (some duplicates occur, none lost)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Analysis tier

A

Heart of the streaming architecture that adopts continuous query model and design algorithms specific to streaming problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Continuous query model

A

query that is issued once then continuously executed against the data, often maintaining a state. (Results regularly pushed to client)

Security system - query filters sensor data for human movement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Windowing

A

Carry out analyses on a per-window basis instead of a simple per-item basis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sliding windows

A

define interval of analyses based off time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Fixed windows

A

analyze last 5 minutes of data every 5 minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Overlapping windows

A

analyze last 5 minutes of data every 2 minutes

17
Q

Sampling windows

A

analyze last 2 minutes of data every 5 minutes

taking samples, not all of it!

18
Q

data-driven windows

A

process only when session is active, and x time after it ends

don’t know the lengths ahead of time

19
Q

Event time vs Stream time

A

Event time - when event actually occurs (gold standard of windowing)
stream time - when event enters the streaming system

Time difference is called a skew

20
Q

Watermark

A

Captures progress of event time completeness as processing time progresses

21
Q

Perfect watermark

A

guarantees no late data ever arrives

22
Q

Heuristic watermark

A

estimates progress based on information available about the input stream (faster)

23
Q

Allowed Lateness

A

policy for accepting late data, since it is common

ex: allows 5 minute late data, everything else discarded

note - higher tolerance, longer data must be buffered

24
Q

Accumulation strategy

A

defines how intermediate results must be aggregated

25
Q

Single-pass algorithms

A

once examined, items discarded

ex: count number of elements

26
Q

Approximated algorithms

A

Sampling and random projections for when one-pass algorithms don’t work

27
Q

3 options for data storage after its been analyzed

A

discard it, push back to streaming pipeline, save for future use

28
Q

Data access tier

A

exposes analyzed data to consumer

29
Q

4 common protocols for data collection and data access tiers

A

webhooks, HTTP, Server-sent events (SSE), websockets