Data Engineering Flashcards
What is Data Engineering?
Is about designing, building, and managing systems that handle data efficiently. It involves collecting, storing, processing, and transforming raw data into meaningful information that businesses can use
What is the primary focus of Data Engineering?
focuses on building data pipelines and managing data infrastructure.
What does a Data Analyst do?
Analyzes and interprets data to extract insights for business decisions
What is the role of Data Scientists?
Analyzes data to derive insights
How do Data Engineers support Data Scientists?
Ensure data is clean, reliable, and available
DBA
Manages and optimizes databases
What are examples of databases used in data engineering?
- PostgreSQL
- MySQL
- SQL Server
What are examples of Data Warehouses?
- Snowflake
- Redshift
- BigQuery
What are examples of Data Lakes?
- Azure Data Lake
- AWS S3
- HDFS
What are examples of NoSQL databases?
- MongoDB
- Cassandra
What does Cloud Computing provide?
On-demand access to computing resources
How do data engineers use cloud platforms?
To build scalable data pipelines and warehouses
Difference between Batch and Streaming Data Processing?
- Batch Processing: Processes large amounts at scheduled intervals(e.g., ADF, Apache Spark)
- Streaming Processing: Processes data in real-time as it arrives(e.g., Kafka, Azure Stream Analytics)
Scenario: Use batch processing for end-of-day sales reports and stream processing for real-time fraud detection.
What is the Ingestion Layer in data architecture?
Collects data from multiple sources(e.g., APIs, databases, logs)
What is the Storage Layer in data architecture?
Stores raw, processed, and structured data (e.g., Data Lake, Data Warehouse).
What is the Processing Layer in data architecture?
Transforms and enriches data(e.g., Spark, Databricks, ADF).
What is the Serving Layer in data architecture?
Exposes data for analysis(e.g., Power BI, Tableau)
What is Slowly Changing Dimension (SCD)?
A method to handle historical changes in dimension tables
What are the types of Slowly Changing Dimensions?
- SCD Type 1: Overwrites old data(e.g., updating customer address)
- SCD Type 2: Keeps history with new records
- SCD Type 3: Stores previous and current values in separate columns
What is Data Partitioning?
Splits large datasets into smaller chunks for faster queries and processing
What are the types of Data Partitioning?
- Horizontal Partitioning: Divides by row(e.g., by date, region)
- Vertical Partitioning: Stores specific columns separately
Scenario: Partitioning sales data by year (year=2023
,year=2024
) to speed up queries for a specific year.
What are the benefits of Data Partitioning?
Improves query performance, reduces scan time, and enhances parallel processing
Difference between OLTP and OLAP?
- OLTP (Online Transaction Processing): Used for real-time transactions (e.g., banking, e-commerce). Handles transactional data (e.g., inserting, updating records). Used for day-to-day operations
- OLAP: Used for analytics and reporting (e.g., data warehouses). Handles analytical queries on historical data. Used for business intelligence.
- OLTP is normalized, while OLAP uses denormalized schema for fast queries
What is Change Data Capture (CDC)?
Captures changes in a database and propagates them to a target system.
Used in real-time data replication and incremental ETL
What is the purpose of a schema in databases?
defines the structure of data (e.g., tables, columns, data types). It ensures data consistency and helps in querying and analysis.
What is SQL?
is used to query and manipulate relational databases. Data engineers use SQL to extract, transform, and analyze data
Difference between structured and unstructured, semi-structured data?
- Structured Data: Data that is organized in a fixed format (schema), like tables with rows and columns. (e.g., SQL tables)
- Unstructured Data: No predefined structure (e.g., images, videos, text). Data that doesn’t follow a specific format, making it harder to organize and analyze.
- Semi-structured Data: Data that has some structure but doesn’t fit neatly into tables. It often includes tags or markers to separate data elements. (eg. XML, JSON, NoSQL Db)
What is a NoSQL database?
store unstructured or semi-structured data and are highly scalable. Examples include MongoDB and Cassandra. Use them for flexible schemas and high-speed data access.
What is version control?
tracks changes to code and scripts. It ensures collaboration, reproducibility, and rollback in case of errors.
What is the role of APIs in data engineering?
allow data engineers to extract data from external systems (e.g., social media platforms, payment gateways)
Difference between primary key and foreign key?
- Primary Key: Uniquely identifies a record
- Foreign Key: Links two tables by referencing the primary key of another table
What is Data Normalization?
organizes data to reduce redundancy and improve integrity. It involves splitting tables and defining relationships.
What is Data Replication?
copying data to multiple locations to ensure availability and fault tolerance. It’s important for disaster recovery and high availability.
What is a Data Model?
Defines how data is structured and related. To design one:
* Identify entities (e.g., customers, orders).
* Define relationships (e.g., one customer can place many orders).
* Normalize data to reduce redundancy.
What is a Star Schema?
- A star schema is a data warehouse design with a central fact table connected to dimension tables.
- Dimension tables are linked to the fact table using foreign keys.
- The layout looks like a star, with the fact table at the center
- Simple and easy to understand.
- It’s used for fast query performance in analytical workloads
What is a Snowflake Schema?
- Similar to a star schema but more detailed
Dimension tables are further broken down into sub-dimension tables. - Data is normalized, reducing redundancy.
- The layout looks like a snowflake due to multiple table layers.
- More complex and scalable
- Ideal for larger datasets requiring detailed analysis
What is Data Governance?
Managing data quality, security, and compliance. It’s important to ensure data is accurate, secure, and used responsibly
What is a Fact Table?
- The central table in the schema.
- Contains the data you want to analyze, like sales numbers or transaction amounts.
- Each row represents a specific event or transaction
What are Dimension Tables?
- Provide context to the data in the fact table.
- Contain descriptive information or attributes.
- Each row in a dimension table represents a unique value or category.
What are Parquet and ORC files?
Columnar storage formats optimized for big data processing. They reduce storage costs and improve query performance
What is Data Deduplication?
Removing duplicate records from a dataset. It’s achieved by identifying unique keys or using tools like Apache Spark
Difference between horizontal and vertical scaling?
- Horizontal Scaling: Adding more machines to a system.
- Vertical Scaling: Adding more resources (e.g., CPU, RAM) to a single machine
Difference between Full Load and Incremental Load?
- Full Load: Replaces the entire dataset in the target system
- Incremental Load: Updates only new or changed data
Scenario: Use a full load for initial data migration and incremental loads for daily updates.
What is Big Data?
- Refers to extremely large datasets that come from various sources and require advanced tools to store, process, and analyze effectively.
- It includes structured, semi-structured, and unstructured data that can range from terabytes to petabytes or even exabytes
Key Characteristics of Big Data
- Volume:
Refers to the massive amount of data generated every second from multiple sources. - Variety:
Data comes in different formats like text, images, videos, audio, and sensor data. - Velocity:
Data is generated at a high speed, requiring quick processing. - Veracity:
Refers to the accuracy and reliability of the data. - Value:
Extracting meaningful insights from data creates business value
What does ACID stand for?
- Atomicity
- Consistency
- Isolation
- Durability
Ensure database transactions are reliable and consistent.
Atomicity
A transaction is a single unit of work, meaning it must either be completed fully or not at all. For example, if you’re transferring money between two bank accounts, both the debit and credit actions must happen together. If one fails, both should fail to keep the system consistent
Consistency
A transaction should bring the database from one valid state to another. The database should always follow its rules and constraints. For example, if a rule says a person’s age must be over 18, a transaction trying to add someone younger than 18 should be stopped.
Isolation
Transactions should not interfere (ကြားဝင် / နှောင့်ယှက် ) with each other. Each transaction must happen independently, so that no other transaction affects it. For example, if two people are trying to withdraw money from the same account at the same time, the system must prevent both from succeeding and causing an error.
Durability
Once a transaction is completed, it must be permanent, even if there’s a system failure or crash. The changes made by the transaction should be saved and not lost.
What is the role of a data engineer in data security?
Ensures data is secure by implementing encryption, access controls, and monitoring systems.
How to handle duplicate records in a dataset?
Use SQL DISTINCT or ROW_NUMBER() to remove duplicates
How to optimize a slow SQL query?
- Use Indexes to speed up searches.
- Choose the Right Join -> Optimize JOINs by selecting only necessary columns.
- Use Partitioning for large tables.
- Avoid SELECT * (fetch only required columns).
- Check Query Execution Plan for bottlenecks.
Steps to debug a job that has slowed down?
- Check data volume
- Review query performance
- Check resource allocation
- Optimize partitions and indexing
- Monitor network
What is the purpose of using indexes in SQL queries?
Indexes improve the speed of data retrieval operations on a database table.
Indexes allow the database to find rows more quickly without scanning the entire table.
What should you check if a job’s execution time has increased significantly?
- Check data volume
- Review query performance
- Check resource allocation
- Optimize partitions and indexing
- Monitor network & I/O bottlenecks
Assessing these factors can help identify the cause of performance issues.
What is a recommended approach for loading 1 billion records into a database?
- Use bulk inserts instead of row-by-row inserts.
- Use partitioning and indexing to speed up writes.
- Disable indexes and constraints before load and re-enable after.
- Use parallel processing (e.g., Apache Spark, ADF).
- Store data in compressed formats like Parquet for efficiency.
Bulk inserts are more efficient for loading large datasets.
How can you ensure data quality in a data pipeline?
- Validation Checks
- Monitoring and Logging
- Unit Testing
- Data Profiling
These steps help maintain data integrity throughout the pipeline.
What are common failures in a data pipeline?
- Network Failures
- Schema Changes
- Data Skew
- Job Failures
Each type of failure requires specific handling strategies.
What is a data warehouse?
A data warehouse is a centralized repository for structured data, optimized for querying and analysis.
It provides a unified view of data for decision-making.
What are best practices for designing a Data Warehouse?
- Use Star Schema or Snowflake Schema
- Optimize storage using columnar formats
- Implement partitioning and clustering
- Design for scalability
These practices enhance efficiency and performance.
What is the difference between a Data Lake and a Data Warehouse?
Data Lake: Stores raw data, unstructured/semi-structured, without a predefined schema, allowing data in any format.
Data Warehouse: Stores structured data, organized into tables with a predefined schema. optimized for analytical queries.
Data Lakehouse (e.g., Delta Lake, Snowflake) combines the features of a data lake and a data warehouse, enabling both storage and analytics.
A Data Lakehouse combines features of both, enabling storage and analytics.
What is the role of Apache Spark in data engineering?
Apache Spark is a distributed computing framework used for processing large datasets. It’s faster than Hadoop MapReduce and supports batch and real-time processing
It is faster than Hadoop MapReduce.
What does Docker do in data engineering?
Docker is used to containerize applications, ensuring consistency across environments. Data engineers use it to deploy ETL pipelines and data processing tools
Data engineers use it to deploy ETL pipelines and data processing tools.
How do you handle late-arriving data in a streaming pipeline?
- Use watermarking
- Store late data in a separate table
- Use event-time processing
These strategies help manage data that arrives after the expected time.
What is the purpose of using encryption for sensitive data?
Encryption protects sensitive data from unauthorized access by encoding it.
AES-256 is a common encryption standard.
What is the function of Apache Kafka in data engineering?
Apache Kafka is a distributed streaming platform used for building real-time data pipelines. It enables high-throughput, fault-tolerant messaging between systems
It is commonly used for building real-time data pipelines.