Batch Data Processing Flashcards
What should data pipelines support?
Interpretability and observability
What are the most important pipeline features?
1 - immutable data
2 - data lineage
3 - test feature
Why is immutable data important?
To make reproducible outcomes possible
Why is data lineage important?
for diagnostics
Why are test running features important?
To validate assumptions that have been made
What kind of tests have to be done for the testing step?
1 - health check
2 - integration test
3 - latency test
Health test
Checks if the job has succeeded
Integration test
Verifies if some mock data makes its way through the data transformation
Latency test
measures the time it takes for the data pipeline to complete
Benefits of batch processing
1 - Load balancing (shift the time of the job processing to when the computing resources are less busy)
2 - Reducing manual intervention and supervision
3 - Overall high rate of utilisation
4 - Allowing priority differences
Combiner function
An extra layer which pre-aggregates values int he mapper itself. This can only be done if the reduce function is commutative and associative.