Data Engineering Fundamentals - Data Ware house vs Lakes Flashcards

1
Q

Centralised repo, where optimised for analytics and where data is from different sources, stored in a structured format.

a) Data warehouse
b) Data Lake

A

Data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Designed for complex queries,
Data is cleaned, transformed and loaded.

Optimised for read heavy operations.

a) Data warehouse
b) Data Lake

A

a) Data Warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What database offering would be commonly used for a Data Warehouse in AWS?

a) Redshift
b) NoSQL

A

a) Amazon Redshift

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

a storage repo that holds vast amounts of raw data in its native format, including structured, semi structured and unsutructured data.

a) Data warehouse
b) Data Lake

A

b) Data Lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

can store large amounts of volumes of raw data without predfined schemas.

a) Data warehouse
b) Data Lake

A

b) data lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Often stores data as is, no predfeined format, can be queried for data tranformation exploration purposes.

a) Data warehouse
b) Data Lake

A

b) data lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What AWS service would you typcially use for Data Lake storage?

a) S3
b) RDS
c) Redshift

A

a) Typically use AWS S3 when using as a Data lake.

You would then use Glue -> to extract, Athena to use that catalog to figure out the data and how to query it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Schema on Write (predefined schema before writing data)

ETL

a) Data warehouse
b) Data Lake

A

a) Data Warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

schema on read (schema defined at the time of reading data)

ELT

a) Data warehouse
b) Data Lake

A

b) Data Lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You would use __________ when data integration from different sources is important, BI and analytics are the primary use cases. You have structured and data sources and require fast and complex queries.

a) Data warehouse
b) Data Lake

A

a) Data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

You would use ______________ when you have a mixed of structured, semi and unstrucuted data.

When you need a scalable and cost-effective solution to store massive amounts of data.

Future needs for data are uncertain and you want flexibility in storage and processing.

a) Data warehouse
b) Data Lake

A

b) data lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

_____________ is hybrid data architecture that combines the best of data lakes and data warehouse.

Supports both structure and unstructured data, allow for schema on write and schema on read, provides capabilities for both detailed analytics and ml.
Typically built on top of cloud or distributed architecture.

a) Data Lake
b) Dataware house lake
c) Data lakehouse

A

c) data lakehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

____________ would the recommended aws service for the Data Lakehouse set-up.

COnsist of S3 and Redshift Spectrum

a) AWS Lakehouse house
b) AWS Lake Formation

A

Lake Formation

Hybird between Data warehouse and Data lake.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

__________ where indv teams own “data products” within a given domain.

These data products serve various “use cases” around the organisation.

a) Data Lake
b) Data Mesh
c) Data Domain

A

b) Data Mesh

You would hit the domains instead of the raw data.
Each domain would then have federated govt with central standards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

________ pipelines, stands for extract, transform and load.

Process used to move data fro source systems into a data warehouse.

a) ELT
b) ETL

A

ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

E stand for ______________ in ETL.

retreive raw data from source systems. ENsure data integrity during the ______ phase.
Can be done in real-time or in batches, depending on requirements.

a) enormous
b) extract
c) enhance

A

b) extract

17
Q

T in ETL stands for __________

Where you convert data into a format suitable for the target data warehouse.

Can involve various operations such as:
- Data cleansing, Data enrichment, Format changes, Aggregations, Encoding or decoding data, handling missing values.

A

Transform

18
Q

L in ETL stands for __________

Move the transformed data into the target data warehouse or another data repository.

Can be done in batches or in a streaming manner.

Ensure that data maintains its integrity during the loading phase.

A

Load

19
Q

What service would you use to automate ETL pipelines.

a) AWS RDS
b) AWS Glue

A

AWS Glue

You would then have some sort of orchestration service:

Event bridge, Amazon managed workflow for Apache Airflow, Lambda, GLue workflows.