Data Engineering Deck Flashcards

Question 1

Q

What does ETL stand for, and what are the main steps involved?

Answer

A

ETL stands for Extract, Transform and Load. The steps are extracting data from sources, transforming it (cleaning, reformatting), and loading it into a target data warehouse.

Question 2

Q

Why is ETL important in data warehousing?

Answer

A

ETL processes ensure that data is structured, cleansed, and prepared before being stored in a data warehouse, allowing for consistent, reliable, and high-quality data for analysis.

Question 3

Q

What is an OLTP system, and how does it differ from an OLAP system?

Answer

A

**OLTP (Online Transaction Processing) **is optimized for handling daily transactions and data entry, while OLAP (Online Analytical Processing) is optimized for querying and reporting, especially for complex analytical queries in data warehouses.

Question 4

Q

What is the difference between a star schema and a snowflake schema in data warehousing?

Answer

A

A star schema has a central fact table connected to dimension tables directly, whereas a **snowflake schema **normalizes the dimensions, creating additional tables and potentially reducing redundancy.

Question 5

Q

Define “fact table” and “dimension table” in a data warehouse context.

Answer

A

A **fact table **contains quantitative data (metrics) for analysis, like sales figures.

Dimension tables store descriptive attributes related to facts, such as time, location, or customer details.

Question 6

Q

What is Amazon Redshift, and how is it used in data warehousing?

Answer

A

Amazon Redshift is a **cloud-based data warehouse **service on AWS, optimized for large-scale data analytics and complex queries.

It allows for fast querying using parallel processing.

Question 7

Q

Explain the concept of data warehousing.

Answer

A

Data warehousing involves** collecting and managing data from various sources** to support business intelligence activities. It allows for efficient querying, reporting, and analytics.

Question 8

Q

Describe the main use cases for SQL in data warehousing.

Answer

A

SQL is used in data warehousing for data extraction, querying, transformation, and loading processes, as well as performing aggregations and joining tables for reporting and analysis.

Question 9

Q

What is a data pipeline, and why is it important in ETL processes?

Answer

A

A data pipeline is a **series of data processing steps **that move data from a source to a destination, often through ETL processes. It ensures data is consistently and efficiently transferred for analysis.

Question 10

Q

What are some best practices for building ETL pipelines?

Answer

A

Best practices include :
* data validation at each stage
* ensuring scalability
* handling errors gracefully
* documenting each step
* optimizing for performance to minimize bottlenecks.

Question 11

Q

What is the primary purpose of cloud platforms like AWS in data warehousing?

Answer

A

Cloud platforms provide scalable infrastructure, allowing businesses to store and process large datasets, access data on-demand, and scale resources based on usage.

Question 12

Q

Explain the purpose of S3 in AWS.

Answer

A

Amazon S3 (Simple Storage Service) is used for** storing and retrieving large volumes of data in the cloud**, serving as a common storage layer for data lakes and data warehouses.

Question 13

Q

What are the benefits of using cloud-based data warehouses, such as Redshift, over on-premise data warehouses?

Answer

A

Cloud-based warehouses offer scalability, flexibility, cost efficiency (pay-as-you-go), and ease of maintenance, as well as advanced features like machine learning integration and automation.

Question 14

Q

Why is data security important in cloud environments?

Answer

A

Data security protects sensitive information from unauthorized access, ensuring data privacy and compliance with regulations, which is critical in cloud environments due to remote access.

Question 15

Q

What is data compliance, and why is it important?

Answer

A

Data compliance ensures that data handling practices align with laws and regulations (e.g., GDPR, HIPAA), protecting user privacy and reducing legal risks.

Question 16

Q

Describe an ETL tool you have used or are familiar with.

Answer

Study These Flashcards

A

AWS Glue is an ETL service on AWS that automates data discovery, transformation, and loading processes, and integrates well with other AWS services for building data pipelines.

Question 17

Q

What is the role of statistics in data analytics?

Answer

Study These Flashcards

A

Statistics help in making inferences, identifying trends, and validating hypotheses in data analysis, providing a foundation for understanding data patterns and relationships.

Question 18

Q

Explain the difference between a primary key and a foreign key.

Answer

Study These Flashcards

A

A primary key uniquely identifies each record in a table

while a** foreign key **creates a relationship between two tables, linking a record in one table to a record in another.

Question 19

Q

Explain the concept of normalization in database management.

Answer

Study These Flashcards

A

Normalization organizes a database to reduce redundancy and improve data integrity by **dividing large tables into smaller ones and establishing relationships between them.

Question 20

Q

What is Amazon DynamoDB?

Answer

Study These Flashcards

A

Amazon DynamoDB is a NoSQL database service provided by AWS, designed for fast, scalable data storage and retrieval, commonly used for applications that require low-latency access.

Question 21

Q

What is an index in a database, and why is it useful?

Answer

Study These Flashcards

A

An index is a database structure that improves the speed of data retrieval operations by creating a data structure that allows for faster search and sorting of records.

Question 22

Q

Define “fact table” and “dimension table” in data warehousing.

Answer

Study These Flashcards

A

A fact table stores quantitative data for analysis (e.g., sales amounts), while dimension tables store descriptive attributes (e.g., time, location) related to facts for contextual analysis.

Question 23

Q

Answer

Study These Flashcards

A

Data Engineering Deck Flashcards

(23 cards)