Data Engineering Deck Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What does ETL stand for, and what are the main steps involved?

A

ETL stands for Extract, Transform and Load. The steps are extracting data from sources, transforming it (cleaning, reformatting), and loading it into a target data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is ETL important in data warehousing?

A

ETL processes ensure that data is structured, cleansed, and prepared before being stored in a data warehouse, allowing for consistent, reliable, and high-quality data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an OLTP system, and how does it differ from an OLAP system?

A

**OLTP (Online Transaction Processing) **is optimized for handling daily transactions and data entry, while OLAP (Online Analytical Processing) is optimized for querying and reporting, especially for complex analytical queries in data warehouses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between a star schema and a snowflake schema in data warehousing?

A

A star schema has a central fact table connected to dimension tables directly, whereas a **snowflake schema **normalizes the dimensions, creating additional tables and potentially reducing redundancy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define “fact table” and “dimension table” in a data warehouse context.

A

A **fact table **contains quantitative data (metrics) for analysis, like sales figures.

Dimension tables store descriptive attributes related to facts, such as time, location, or customer details.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Amazon Redshift, and how is it used in data warehousing?

A

Amazon Redshift is a **cloud-based data warehouse **service on AWS, optimized for large-scale data analytics and complex queries.

It allows for fast querying using parallel processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the concept of data warehousing.

A

Data warehousing involves** collecting and managing data from various sources** to support business intelligence activities. It allows for efficient querying, reporting, and analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the main use cases for SQL in data warehousing.

A

SQL is used in data warehousing for data extraction, querying, transformation, and loading processes, as well as performing aggregations and joining tables for reporting and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a data pipeline, and why is it important in ETL processes?

A

A data pipeline is a **series of data processing steps **that move data from a source to a destination, often through ETL processes. It ensures data is consistently and efficiently transferred for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some best practices for building ETL pipelines?

A

Best practices include :
* data validation at each stage
* ensuring scalability
* handling errors gracefully
* documenting each step
* optimizing for performance to minimize bottlenecks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the primary purpose of cloud platforms like AWS in data warehousing?

A

Cloud platforms provide scalable infrastructure, allowing businesses to store and process large datasets, access data on-demand, and scale resources based on usage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain the purpose of S3 in AWS.

A

Amazon S3 (Simple Storage Service) is used for** storing and retrieving large volumes of data in the cloud**, serving as a common storage layer for data lakes and data warehouses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the benefits of using cloud-based data warehouses, such as Redshift, over on-premise data warehouses?

A

Cloud-based warehouses offer scalability, flexibility, cost efficiency (pay-as-you-go), and ease of maintenance, as well as advanced features like machine learning integration and automation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why is data security important in cloud environments?

A

Data security protects sensitive information from unauthorized access, ensuring data privacy and compliance with regulations, which is critical in cloud environments due to remote access.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data compliance, and why is it important?

A

Data compliance ensures that data handling practices align with laws and regulations (e.g., GDPR, HIPAA), protecting user privacy and reducing legal risks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe an ETL tool you have used or are familiar with.

A

AWS Glue is an ETL service on AWS that automates data discovery, transformation, and loading processes, and integrates well with other AWS services for building data pipelines.

17
Q

What is the role of statistics in data analytics?

A

Statistics help in making inferences, identifying trends, and validating hypotheses in data analysis, providing a foundation for understanding data patterns and relationships.

18
Q

Explain the difference between a primary key and a foreign key.

A

A primary key uniquely identifies each record in a table

while a** foreign key **creates a relationship between two tables, linking a record in one table to a record in another.

19
Q

Explain the concept of normalization in database management.

A

Normalization organizes a database to reduce redundancy and improve data integrity by **dividing large tables into smaller ones and establishing relationships between them.

20
Q

What is Amazon DynamoDB?

A

Amazon DynamoDB is a NoSQL database service provided by AWS, designed for fast, scalable data storage and retrieval, commonly used for applications that require low-latency access.

21
Q

What is an index in a database, and why is it useful?

A

An index is a database structure that improves the speed of data retrieval operations by creating a data structure that allows for faster search and sorting of records.

22
Q

Define “fact table” and “dimension table” in data warehousing.

A

A fact table stores quantitative data for analysis (e.g., sales amounts), while dimension tables store descriptive attributes (e.g., time, location) related to facts for contextual analysis.

23
Q
A