1.ETL and Data Pipelines Flashcards

1
Q

What is ETL, and how is it different from ELT?

A

ETL stands for Extract, Transform, Load, while ELT stands for Extract, Load, Transform. The key difference is the order of operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How would you design a scalable ETL pipeline for 1TB+ of daily data?

A

A scalable ETL pipeline for large data volumes should utilize distributed processing, efficient data storage, and parallel processing techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you optimize ETL performance for large datasets?

A

Optimizing ETL performance can be achieved through data partitioning, indexing, parallel processing, and minimizing data movement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the common issues in ETL pipelines, and how do you troubleshoot them?

A

Common issues include data quality errors, performance bottlenecks, and connectivity problems. Troubleshooting involves logging, monitoring, and analyzing error messages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How does incremental data loading work in ETL?

A

Incremental data loading involves only loading new or changed data since the last ETL process, reducing load times and resource usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is data partitioning? Why is it important in ETL workflows?

A

Data partitioning is the process of dividing a dataset into smaller, more manageable pieces. It is important for improving query performance and parallel processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you ensure data quality in ETL pipelines?

A

Ensuring data quality involves implementing validation checks, data cleansing processes, and monitoring data integrity throughout the ETL pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is change data capture (CDC), and how does it work?

A

Change Data Capture (CDC) is a technique used to identify and capture changes made to data in a database, allowing for real-time data updates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would you implement a fault-tolerant ETL pipeline?

A

A fault-tolerant ETL pipeline can be implemented using retry mechanisms, data backups, and checkpointing to recover from failures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can you explain how job orchestration works in tools like Apache Airflow?

A

Job orchestration in tools like Apache Airflow involves scheduling, managing, and monitoring workflows to ensure tasks are executed in the correct order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly