1.ETL and Data Pipelines Flashcards
What is ETL, and how is it different from ELT?
ETL stands for Extract, Transform, Load, while ELT stands for Extract, Load, Transform. The key difference is the order of operations.
How would you design a scalable ETL pipeline for 1TB+ of daily data?
Scalable ETL Pipeline Design (1TB+ Daily)
A scalable ETL pipeline for large data volumes should utilize distributed processing, efficient data storage, and parallel processing techniques.
- Distributed Computing Framework
- Horizontal Scaling
- Clustered Infrastructure
- Ingestion Layer
- Streaming (Kafka, Kinesis)
- Batch Processing
- Robust Error Handling - Storage Strategy
- Data Lake Architecture
- Tiered Storage
- Compressed Formats (Parquet) - Transformation
- Modular Design
- Parallel Processing
- Idempotent Transformations
- Elastic Resource Scaling
- Fault Tolerance
- Dynamic Resource Allocation
- Distributed Caching
- Adaptive Query Execution
- Continuous Monitoring
- Apache Spark/Flink
- Cloud Storage
- Kafka/Kinesis
- Apache Airflow
- Modularity
- Scalability
- Cost-Efficiency
- Security
How do you optimize ETL performance for large datasets?
Optimizing ETL performance can be achieved through data partitioning, indexing, parallel processing, and minimizing data movement.
Efficient Data Partitioning – Use partitioning and bucketing in storage (e.g., Hive, Delta, Parquet) to improve query performance.
Parallel Processing – Leverage distributed frameworks like Apache Spark or AWS Glue to process data in parallel.
Incremental Processing – Use CDC (Change Data Capture) or delta processing to avoid full reloads.
Optimized Storage Formats – Store data in columnar formats like Parquet or ORC for better compression and faster reads.
Indexing & Caching – Use indexes, materialized views, and caching layers to speed up lookups.
Memory & Compute Tuning– Adjust batch sizes, optimize shuffle operations, and tune resources (CPU, memory) for efficiency.
**Pipeline Monitoring & Auto-scaling **– Continuously monitor performance and enable auto-scaling in cloud environments.
What are the common issues in ETL pipelines, and how do you troubleshoot them?
Common issues include data quality errors, performance bottlenecks, and connectivity problems. Troubleshooting involves logging, monitoring, and analyzing error messages.
Data Quality Issues (Missing, Duplicate, Corrupt Data)
Cause: Inconsistent source data, schema changes, or faulty transformations.
Fix: Implement schema validation, deduplication, and data profiling tools.
Performance Bottlenecks (Slow Processing, High Latency)
Cause: Inefficient queries, excessive I/O operations, or resource underutilization.
Fix: Optimize queries, use columnar storage (Parquet/ORC), partition data, and enable parallel processing.
Pipeline Failures (Job Crashes, Incomplete Loads)
Cause: Hardware failures, incorrect configurations, or network issues.
Fix: Implement retry logic, checkpointing, and alerting mechanisms for proactive monitoring.
Schema Drift (Unexpected Source Changes)
Cause: New columns, data type changes, or missing fields in source data.
Fix: Use schema evolution techniques, enable auto-schema detection, and maintain version control.
Data Loss or Duplication
Cause: Improper deduplication logic, incorrect join conditions, or missing idempotency checks.
Fix: Implement primary keys, hashing, and CDC techniques to track changes accurately.
Resource Overuse (High Memory/CPU Consumption)
Cause: Large shuffles, unoptimized joins, or insufficient memory allocation.
Fix: Tune batch sizes, optimize joins (broadcast joins), and configure proper resource limits.
Integration Issues (APIs, Third-party Systems Failing)
Cause: API rate limits, authentication failures, or format mismatches.
Fix: Implement retries, use exponential backoff, and validate response formats before ingestion.
How does incremental data loading work in ETL?
Incremental data loading involves only loading new or changed data since the last ETL process, reducing load times and resource usage.
What is data partitioning? Why is it important in ETL workflows?
Data partitioning is the process of dividing a dataset into smaller, more manageable pieces. It is important for improving query performance and parallel processing.
Data partitioning is the process of dividing a large dataset into smaller, more manageable chunks based on specific criteria such as range, hash, or list values. This improves query performance, parallel processing, and storage efficiency in ETL workflows.
Importance in ETL Workflows:
Performance Optimization– Reduces the amount of data scanned in queries, leading to faster processing.
Parallelism – Enables distributed processing across multiple nodes, improving throughput.
Efficient Data Retrieval – Queries can target specific partitions instead of scanning the entire dataset.
Cost Reduction– In cloud environments, partition pruning minimizes resource usage, lowering costs.
Better Data Management– Helps in organizing data logically, improving maintainability and archival processes.
How do you ensure data quality in ETL pipelines?
Ensuring data quality involves implementing validation checks, data cleansing processes, and monitoring data integrity throughout the ETL pipeline.
Data Validation – Use schema enforcement, constraints, and automated tests to catch anomalies.
Deduplication & Cleaning – Handle missing values, standardize formats, and remove duplicates.
Logging & Monitoring – Track ETL jobs, set alerts for failures, and analyze logs for inconsistencies.
Idempotency & Checkpointing – Ensure re-runs don’t cause duplicate processing or data corruption.
Reconciliation & Auditing– Compare source and destination records, maintain data lineage, and generate quality reports.
What is change data capture (CDC), and how does it work?
Change Data Capture (CDC) is a technique used to identify and track changes in a database in real-time or near real-time. It enables efficient data synchronization between source and target systems without requiring a full data reload.
How CDC Works:
Log-Based CDC– Monitors database transaction logs (e.g., MySQL binlog, PostgreSQL WAL, SQL Server CDC) to capture inserts, updates, and deletes with minimal overhead.
Trigger-Based CDC – Uses database triggers to capture changes but may introduce performance overhead.
Timestamp-Based CDC – Relies on a timestamp column to fetch only new or modified records since the last extraction.
Snapshot-Based CDC– Compares the entire dataset at intervals but is less efficient for high-volume data.
Why CDC is Important?
Enables real-time ETL and analytics by continuously propagating changes.
Reduces data transfer costs by processing only incremental updates.
Improves data consistency in data lakes, warehouses, and streaming pipelines.
Supports event-driven architectures in systems like Kafka, Debezium, and cloud-based data replication solutions.
CDC is widely used in data replication, ETL optimization, streaming architectures, and microservices to ensure efficient and up-to-date data movement.
How would you implement a fault-tolerant ETL pipeline?
A fault-tolerant ETL pipeline can be implemented using retry mechanisms, data backups, and checkpointing to recover from failures.
Can you explain how job orchestration works in tools like Apache Airflow?
Job orchestration in tools like Apache Airflow involves scheduling, managing, and monitoring workflows to ensure tasks are executed in the correct order.