Data Engineering Deck Flashcards
What does ETL stand for, and what are the main steps involved?
ETL stands for Extract, Transform and Load. The steps are extracting data from sources, transforming it (cleaning, reformatting), and loading it into a target data warehouse.
Why is ETL important in data warehousing?
ETL processes ensure that data is structured, cleansed, and prepared before being stored in a data warehouse, allowing for consistent, reliable, and high-quality data for analysis.
What is an OLTP system, and how does it differ from an OLAP system?
**OLTP (Online Transaction Processing) **is optimized for handling daily transactions and data entry, while OLAP (Online Analytical Processing) is optimized for querying and reporting, especially for complex analytical queries in data warehouses.
What is the difference between a star schema and a snowflake schema in data warehousing?
A star schema has a central fact table connected to dimension tables directly, whereas a **snowflake schema **normalizes the dimensions, creating additional tables and potentially reducing redundancy.
Define “fact table” and “dimension table” in a data warehouse context.
A **fact table **contains quantitative data (metrics) for analysis, like sales figures.
Dimension tables store descriptive attributes related to facts, such as time, location, or customer details.
What is Amazon Redshift, and how is it used in data warehousing?
Amazon Redshift is a **cloud-based data warehouse **service on AWS, optimized for large-scale data analytics and complex queries.
It allows for fast querying using parallel processing.
Explain the concept of data warehousing.
Data warehousing involves** collecting and managing data from various sources** to support business intelligence activities. It allows for efficient querying, reporting, and analytics.
Describe the main use cases for SQL in data warehousing.
SQL is used in data warehousing for data extraction, querying, transformation, and loading processes, as well as performing aggregations and joining tables for reporting and analysis.
What is a data pipeline, and why is it important in ETL processes?
A data pipeline is a **series of data processing steps **that move data from a source to a destination, often through ETL processes. It ensures data is consistently and efficiently transferred for analysis.
What are some best practices for building ETL pipelines?
Best practices include :
* data validation at each stage
* ensuring scalability
* handling errors gracefully
* documenting each step
* optimizing for performance to minimize bottlenecks.
What is the primary purpose of cloud platforms like AWS in data warehousing?
Cloud platforms provide scalable infrastructure, allowing businesses to store and process large datasets, access data on-demand, and scale resources based on usage.
Explain the purpose of S3 in AWS.
Amazon S3 (Simple Storage Service) is used for** storing and retrieving large volumes of data in the cloud**, serving as a common storage layer for data lakes and data warehouses.
What are the benefits of using cloud-based data warehouses, such as Redshift, over on-premise data warehouses?
Cloud-based warehouses offer scalability, flexibility, cost efficiency (pay-as-you-go), and ease of maintenance, as well as advanced features like machine learning integration and automation.
Why is data security important in cloud environments?
Data security protects sensitive information from unauthorized access, ensuring data privacy and compliance with regulations, which is critical in cloud environments due to remote access.
What is data compliance, and why is it important?
Data compliance ensures that data handling practices align with laws and regulations (e.g., GDPR, HIPAA), protecting user privacy and reducing legal risks.