Data Engineering Fundamentals - Data Ware house vs Lakes Flashcards
Centralised repo, where optimised for analytics and where data is from different sources, stored in a structured format.
a) Data warehouse
b) Data Lake
Data warehouse
Designed for complex queries,
Data is cleaned, transformed and loaded.
Optimised for read heavy operations.
a) Data warehouse
b) Data Lake
a) Data Warehouse
What database offering would be commonly used for a Data Warehouse in AWS?
a) Redshift
b) NoSQL
a) Amazon Redshift
a storage repo that holds vast amounts of raw data in its native format, including structured, semi structured and unsutructured data.
a) Data warehouse
b) Data Lake
b) Data Lake
can store large amounts of volumes of raw data without predfined schemas.
a) Data warehouse
b) Data Lake
b) data lake
Often stores data as is, no predfeined format, can be queried for data tranformation exploration purposes.
a) Data warehouse
b) Data Lake
b) data lake
What AWS service would you typcially use for Data Lake storage?
a) S3
b) RDS
c) Redshift
a) Typically use AWS S3 when using as a Data lake.
You would then use Glue -> to extract, Athena to use that catalog to figure out the data and how to query it.
Schema on Write (predefined schema before writing data)
ETL
a) Data warehouse
b) Data Lake
a) Data Warehouse
schema on read (schema defined at the time of reading data)
ELT
a) Data warehouse
b) Data Lake
b) Data Lake
You would use __________ when data integration from different sources is important, BI and analytics are the primary use cases. You have structured and data sources and require fast and complex queries.
a) Data warehouse
b) Data Lake
a) Data warehouse
You would use ______________ when you have a mixed of structured, semi and unstrucuted data.
When you need a scalable and cost-effective solution to store massive amounts of data.
Future needs for data are uncertain and you want flexibility in storage and processing.
a) Data warehouse
b) Data Lake
b) data lake
_____________ is hybrid data architecture that combines the best of data lakes and data warehouse.
Supports both structure and unstructured data, allow for schema on write and schema on read, provides capabilities for both detailed analytics and ml.
Typically built on top of cloud or distributed architecture.
a) Data Lake
b) Dataware house lake
c) Data lakehouse
c) data lakehouse
____________ would the recommended aws service for the Data Lakehouse set-up.
COnsist of S3 and Redshift Spectrum
a) AWS Lakehouse house
b) AWS Lake Formation
Lake Formation
Hybird between Data warehouse and Data lake.
__________ where indv teams own “data products” within a given domain.
These data products serve various “use cases” around the organisation.
a) Data Lake
b) Data Mesh
c) Data Domain
b) Data Mesh
You would hit the domains instead of the raw data.
Each domain would then have federated govt with central standards.
________ pipelines, stands for extract, transform and load.
Process used to move data fro source systems into a data warehouse.
a) ELT
b) ETL
ETL