Batch Data Processing Flashcards

Question 1

Q

What should data pipelines support?

Answer

A

Interpretability and observability

Question 2

Q

What are the most important pipeline features?

Answer

A

1 - immutable data
2 - data lineage
3 - test feature

Question 3

Q

Why is immutable data important?

Answer

A

To make reproducible outcomes possible

Question 4

Q

Why is data lineage important?

Answer

A

for diagnostics

Question 5

Q

Why are test running features important?

Answer

A

To validate assumptions that have been made

Question 6

Q

What kind of tests have to be done for the testing step?

Answer

A

1 - health check
2 - integration test
3 - latency test

Question 7

Q

Health test

Answer

A

Checks if the job has succeeded

Question 8

Q

Integration test

Answer

A

Verifies if some mock data makes its way through the data transformation

Question 9

Q

Latency test

Answer

A

measures the time it takes for the data pipeline to complete

Question 10

Q

Benefits of batch processing

Answer

A

1 - Load balancing (shift the time of the job processing to when the computing resources are less busy)
2 - Reducing manual intervention and supervision
3 - Overall high rate of utilisation
4 - Allowing priority differences

Question 11

Q

Combiner function

Answer

A

An extra layer which pre-aggregates values int he mapper itself. This can only be done if the reduce function is commutative and associative.

Batch Data Processing Flashcards

(11 cards)