Data Engineering Fundamentals - Data Ware house vs Lakes Flashcards

Question 1

Q

Centralised repo, where optimised for analytics and where data is from different sources, stored in a structured format.

a) Data warehouse
b) Data Lake

Answer

A

Data warehouse

Question 2

Q

Designed for complex queries,
Data is cleaned, transformed and loaded.

Optimised for read heavy operations.

a) Data warehouse
b) Data Lake

Answer

A

a) Data Warehouse

Question 3

Q

What database offering would be commonly used for a Data Warehouse in AWS?

a) Redshift
b) NoSQL

Answer

A

a) Amazon Redshift

Question 4

Q

a storage repo that holds vast amounts of raw data in its native format, including structured, semi structured and unsutructured data.

a) Data warehouse
b) Data Lake

Answer

A

b) Data Lake

Question 5

Q

can store large amounts of volumes of raw data without predfined schemas.

a) Data warehouse
b) Data Lake

Answer

A

b) data lake

Question 6

Q

Often stores data as is, no predfeined format, can be queried for data tranformation exploration purposes.

a) Data warehouse
b) Data Lake

Answer

A

b) data lake

Question 7

Q

What AWS service would you typcially use for Data Lake storage?

a) S3
b) RDS
c) Redshift

Answer

A

a) Typically use AWS S3 when using as a Data lake.

You would then use Glue -> to extract, Athena to use that catalog to figure out the data and how to query it.

Question 8

Q

Schema on Write (predefined schema before writing data)

ETL

a) Data warehouse
b) Data Lake

Answer

A

a) Data Warehouse

Question 9

Q

schema on read (schema defined at the time of reading data)

ELT

a) Data warehouse
b) Data Lake

Answer

A

b) Data Lake

Question 10

Q

You would use __________ when data integration from different sources is important, BI and analytics are the primary use cases. You have structured and data sources and require fast and complex queries.

a) Data warehouse
b) Data Lake

Answer

A

a) Data warehouse

Question 11

Q

You would use ______________ when you have a mixed of structured, semi and unstrucuted data.

When you need a scalable and cost-effective solution to store massive amounts of data.

Future needs for data are uncertain and you want flexibility in storage and processing.

a) Data warehouse
b) Data Lake

Answer

A

b) data lake

Question 12

Q

_____________ is hybrid data architecture that combines the best of data lakes and data warehouse.

Supports both structure and unstructured data, allow for schema on write and schema on read, provides capabilities for both detailed analytics and ml.
Typically built on top of cloud or distributed architecture.

a) Data Lake
b) Dataware house lake
c) Data lakehouse

Answer

A

c) data lakehouse

Question 13

Q

____________ would the recommended aws service for the Data Lakehouse set-up.

COnsist of S3 and Redshift Spectrum

a) AWS Lakehouse house
b) AWS Lake Formation

Answer

A

Lake Formation

Hybird between Data warehouse and Data lake.

Question 14

Q

__________ where indv teams own “data products” within a given domain.

These data products serve various “use cases” around the organisation.

a) Data Lake
b) Data Mesh
c) Data Domain

Answer

A

b) Data Mesh

You would hit the domains instead of the raw data.
Each domain would then have federated govt with central standards.

Question 15

Q

________ pipelines, stands for extract, transform and load.

Process used to move data fro source systems into a data warehouse.

a) ELT
b) ETL

Question 16

Q

E stand for ______________ in ETL.

retreive raw data from source systems. ENsure data integrity during the ______ phase.
Can be done in real-time or in batches, depending on requirements.

a) enormous
b) extract
c) enhance

Answer

A

b) extract

Question 17

Q

T in ETL stands for __________

Where you convert data into a format suitable for the target data warehouse.

Can involve various operations such as:
- Data cleansing, Data enrichment, Format changes, Aggregations, Encoding or decoding data, handling missing values.

Answer

A

Transform

Question 18

Q

L in ETL stands for __________

Move the transformed data into the target data warehouse or another data repository.

Can be done in batches or in a streaming manner.

Ensure that data maintains its integrity during the loading phase.

Question 19

Q

What service would you use to automate ETL pipelines.

a) AWS RDS
b) AWS Glue

Answer

A

AWS Glue

You would then have some sort of orchestration service:

Event bridge, Amazon managed workflow for Apache Airflow, Lambda, GLue workflows.