General DE Flashcards
How do you handle job failures in an ETL pipeline?
Prevention is better than cure! But you need to be prepared for failures (Alerting, Graceful Degradation, Backfill Strategy, Post-Mortem)
What steps do you take when a data pipeline is running slower than expected?
Know the bottlenecks in advance. On occurence: Examine Logs, Monitor System Resources, Optimise Queries and Batch Sizes, look for scalability, look to distribute.
How do you address data quality issues in a large dataset?
Systematic Approach:
Exploratory Analysis (Find NULLs, duplicates, incorrect typing)
Handle Missing Data (Find patterns, then imputation or deleteion)
Remove Dupes
Standarise Labels and Normalise Numbers
Find and Handle Outliers
What is your approach to handling schema changes in source systems?
Monitoring and Alerting
Design with Evolution in mind and use a Schema Registry
Employ Backwards Compatibility on old fields (default values etc.)
Employ Conditional Logic on optional fields
How do you manage data partitioning in large-scale data processing?
Consider:
Sharding (row sub-sets) - Useful for distributed systems
Key-based (all related rows processed together) - Useful for distrubuted system
Range-based (usually date) - Useful for finance and time-series
Round-Robin - Useful for data with loose structure
Hash-based - Similar to round-robin, but more targetted
What do you do if data ingestion from a third-party API fails?
Check Status
Inspect Error and Logs
Validate Request
Check Rate Limits
What steps do you take when a data job exceeds its allocated time window?
Check ingestion volume
Identify bottlenecks
Optimise queries if needed
Increase resources if needed
How do you address issues with data duplication in a pipeline?
Minimise during extract using key/date queries
Use sliding windows
Use hashing
Utilise DB constraints
Utilise upserts where appropriate
How do you handle and log errors in a distributed data processing job?
Structured error-handling (try-except) that propogates to other nodes where appropriate
Centralise all logging
Allow retries to handle transient errors
Implemenet checkpointing to allow faster retries