1.ETL and Data Pipelines Flashcards
What is ETL, and how is it different from ELT?
ETL stands for Extract, Transform, Load, while ELT stands for Extract, Load, Transform. The key difference is the order of operations.
How would you design a scalable ETL pipeline for 1TB+ of daily data?
A scalable ETL pipeline for large data volumes should utilize distributed processing, efficient data storage, and parallel processing techniques.
How do you optimize ETL performance for large datasets?
Optimizing ETL performance can be achieved through data partitioning, indexing, parallel processing, and minimizing data movement.
What are the common issues in ETL pipelines, and how do you troubleshoot them?
Common issues include data quality errors, performance bottlenecks, and connectivity problems. Troubleshooting involves logging, monitoring, and analyzing error messages.
How does incremental data loading work in ETL?
Incremental data loading involves only loading new or changed data since the last ETL process, reducing load times and resource usage.
What is data partitioning? Why is it important in ETL workflows?
Data partitioning is the process of dividing a dataset into smaller, more manageable pieces. It is important for improving query performance and parallel processing.
How do you ensure data quality in ETL pipelines?
Ensuring data quality involves implementing validation checks, data cleansing processes, and monitoring data integrity throughout the ETL pipeline.
What is change data capture (CDC), and how does it work?
Change Data Capture (CDC) is a technique used to identify and capture changes made to data in a database, allowing for real-time data updates.
How would you implement a fault-tolerant ETL pipeline?
A fault-tolerant ETL pipeline can be implemented using retry mechanisms, data backups, and checkpointing to recover from failures.
Can you explain how job orchestration works in tools like Apache Airflow?
Job orchestration in tools like Apache Airflow involves scheduling, managing, and monitoring workflows to ensure tasks are executed in the correct order.