General DE Flashcards

Question 1

Q

How do you handle job failures in an ETL pipeline?

Answer

A

Prevention is better than cure! But you need to be prepared for failures (Alerting, Graceful Degradation, Backfill Strategy, Post-Mortem)

Question 2

Q

What steps do you take when a data pipeline is running slower than expected?

Answer

A

Know the bottlenecks in advance. On occurence: Examine Logs, Monitor System Resources, Optimise Queries and Batch Sizes, look for scalability, look to distribute.

Question 3

Q

How do you address data quality issues in a large dataset?

Answer

A

Systematic Approach:
Exploratory Analysis (Find NULLs, duplicates, incorrect typing)
Handle Missing Data (Find patterns, then imputation or deleteion)
Remove Dupes
Standarise Labels and Normalise Numbers
Find and Handle Outliers

Question 4

Q

What is your approach to handling schema changes in source systems?

Answer

A

Monitoring and Alerting
Design with Evolution in mind and use a Schema Registry
Employ Backwards Compatibility on old fields (default values etc.)
Employ Conditional Logic on optional fields

Question 5

Q

How do you manage data partitioning in large-scale data processing?

Answer

A

Consider:
Sharding (row sub-sets) - Useful for distributed systems
Key-based (all related rows processed together) - Useful for distrubuted system
Range-based (usually date) - Useful for finance and time-series
Round-Robin - Useful for data with loose structure
Hash-based - Similar to round-robin, but more targetted

Question 6

Q

What do you do if data ingestion from a third-party API fails?

Answer

A

Check Status
Inspect Error and Logs
Validate Request
Check Rate Limits

Question 7

Q

What steps do you take when a data job exceeds its allocated time window?

Answer

A

Check ingestion volume
Identify bottlenecks
Optimise queries if needed
Increase resources if needed

Question 8

Q

How do you address issues with data duplication in a pipeline?

Answer

A

Minimise during extract using key/date queries
Use sliding windows
Use hashing
Utilise DB constraints
Utilise upserts where appropriate

Question 9

Q

How do you handle and log errors in a distributed data processing job?

Answer

A

Structured error-handling (try-except) that propogates to other nodes where appropriate
Centralise all logging
Allow retries to handle transient errors
Implemenet checkpointing to allow faster retries