Streaming Flashcards

1
Q

What is data streaming in big data?

A

Data streaming refers to the continuous processing of data as it arrives in real-time, involving simpler computations over smaller, continuously incoming data sets as opposed to batch analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is streaming data processing important?

A

It helps achieve lower latency, enabling timely data analysis
More consistent resource consumption
Real time decision making
Workload balancing across system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some use cases of data streaming?

A

Use cases include operational monitoring
real time user interaction in web analytics
online advertising with real time bidding
social media analysis
IoT applications like health monitoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cash register model in data streams?

A

The cash register model represents data as increments to a variable
Such as counting packages sent to IP addresses
You INCREMENT the state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the turnstile model in data streaming?

A

The turnstile model includes updates to variables that can both increment and decrement values
(tracking people entering and exiting a gym)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How are data streaming systems classified based on latency tolerance?

A

Systems are classified as hard (no delay tolerance)
Firm (some tolerance for delay, seconds maybe)
Soft (still valuable even when late, weather monitoring)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the key characteristics of streaming data?

A

Generated continuously
One pass processing
Often requires approximation (memory and time constraints)
Handle infinite computation
Low latency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What components are involved in data streaming architecture?

A

Data Collection
Messaging Queue
Analysis
Data Consumer
(in memory storage)
(long term storage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the role of message queuing in data streaming?

A

It facilitates the safe and organized exchange of data between the collection and analysis tiers, decoupling the system components

Also, its worth messaging that it acts as Unified Log in addition to managing data flows between components in the architecture, it also provides an append-only record of events to be accessed by consumer (analytical tier, apache flink)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are common message queuing systems?

A

Apache Kafka (store data in log as topics) and Apache Flume (move data to long term storage like hdfs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the message delivery semantics in streaming?

A

Exactly once (guarantees no data loss and single reads)
At most once (event read once, but maybe also not read)
At least once (ensures data is read, even if multiple reads)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a continuous query in data streaming?

A

A continuous query is executed once and runs continuously over incoming data

Updating results in real-time, often requiring state management for intermediate results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What challenges are associated with continuous queries?

A

Must handle state updates efficiently with memory and time constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is windowing in data streaming?

A

Windowing divides a data stream into finite chunks for processing based on time or number of items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are sliding windows?

A

Sliding windows overlap and process data at regular intervals (time based)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are data-driven windows?

A

These windows are defined by data content and are typically used for session analysis, updating only when specific events occur

17
Q

What is event time versus stream time?

A

Event time is when the data event occurs
Stream time is when data is processed

18
Q

What is a watermark in event time windowing?

A

A watermark indicates the progress of event time, used to manage how long a system waits for late data before closing a window

19
Q

What are triggers in event time windowing?

A

Triggers define when intermediate results for a window are processed, based on criteria like watermark progress or element counts

20
Q

What strategies are used for result accumulation in windowing?

A

Strategies include discarding (independent results)
accumulating (results build on previous ones)
accumulating with retraction (includes both totals and deltas)

21
Q

What is reservoir sampling in data streaming?

A

It is a method to select a random SAMPLE from an incoming data stream, maintaining a fixed-size reservoir of times with probabilistic replacement as new data arrives

22
Q

What is HyperLogLog used for?

A

An algorithm used for approximate counting of DISTINCT items in a data stream, providing space efficiency with an accuracy trade off

23
Q

What is the Count-Min Sketch?

A

A space-efficient algorithm used to approximate FREQUENCY counts of items in data streams, maintaining counters incremented by hash functions and taking the minimum for estimates