Stream Data Processing Flashcards

Question 1

Q

Why can’t event stream data not be stored as big data?

Answer

A

It would result in one small file per event –> too many files

Question 2

Q

Event hub

Answer

A

A buffer that buffers the events and stores them in batches.

Question 3

Q

The different kinds of stream processing

Answer

A

1 - strema data integration

2 - stream analytics

Question 4

Q

Stream data integration

Answer

A

Focuses on ingestion and processing of the data sources targeting ETL

Question 5

Q

Stream analytics

Answer

A

Targets analytics use cases. Calculates aggregates and detects patterns.

Question 6

Q

Native streaming

Answer

A

Events are processed as they arrive -> lowest latency but high fault tolerance

Question 7

Q

Window

Answer

A

A certain amount of data to perform computations on

Question 8

Q

Three types of windows

Answer

A

1 - fixed/tumbling windows
2 - sliding/hopping windows
3 - session windows

Question 9

Q

Fixed/tumbling windows

Answer

A

Stops if the window is full, based on the count of items or the time

Question 10

Q

Sliding/hopping windows

Answer

A

Stops based on window + sliding interval length

Question 11

Q

Session windows

Answer

A

Sequences of temporarily related events terminated by a gap of inactivity

Question 12

Q

Which two kinds of queries are there?

Answer

A

1 - ad-hoc queries

2 - standing queries

Question 13

Q

Standing queries

Answer

A

Queries that are stored and permanently executed

Question 14

Q

Ad-hoc queries

Answer

A

One time questions

Question 15

Q

Main differences between batch processing and stream processing?

Answer

A

1 - The input is not controlled by the system

2 - The input timing/rate is often unknown

Question 16

Q

Multi-query optimization

Answer

Study These Flashcards

A

Using subqueries to not recompute things in different queries

Question 17

Q

Ways to handle overload?

Answer

Study These Flashcards

A

Back pressure
Load shedding
Distributed stream processing

Question 18

Q

Back pressure

Answer

Study These Flashcards

A

Slows down the sources to avoid data loss (for example by blocking queries)

Question 19

Q

Load shedding variants

Answer

Study These Flashcards

A

1 - random sampling-based shedding
2 - relevance-based shedding
3 - summary-based shedding

Question 20

Q

Random-sampling-based load shedding

Answer

Study These Flashcards

A

Taking a random sample and provide an output based on this (approximation)

Question 21

Q

Relevance-based load shedding

Answer

Study These Flashcards

A

Use an algorithm that understands which distances are relevant while the ones that are not relevant are removed

Question 22

Q

Summary-based load shedding

Answer

Study These Flashcards

A

Given a queue, the summary of the queue is provided, which is used as input for the next step

Question 23

Q

distributed stream processing

Answer

Study These Flashcards

A

Distribute the stream based on data flow (distributes the query) or on key range (distrbutes the data stream itself)

Question 24

Q

Three options for delivery guarantees

Answer

Study These Flashcards

A

1 - at most once –> can cause data loss on failures
2 - at least once –> might create incorrect states because data is processed multiple times
3 - exactly once

Stream Data Processing Flashcards

(24 cards)