Data Engineering Basics Flashcards
What are the three types of data?
Structured, Unstructured, Semistructured
What is the definition of structured data?
Data that is organized in a manner or schema. Typically found in relational databases. Consistent structure and uses rows and columns.
What is the definition of unstructured data?
Data that does not have a predefined structure. Examples include videos, audio files, images, emails, and work processing documents
What is the definition of semi-structured data?
It has some structure in the form of tags, hierarchies or other patterns. XML and JSON is a good example of this.
What is the definition of volume in data engineering terms?
It refers to the amount or size of the data. It could be GB, MB, PB.
What is the definition of velocity in data engineering terms?
It refers to the speed at which new data is generated, collected, and processed.
What is the definition of variety in data engineering terms?
It refers to the different types, structures, and sources of data. structured, unstructured, etc..
What is the definition of a data warehouse?
It is a centralized repository optimized for analysis where data from different sources is stored in a structured format.
What are some characteristics of a data warehouse?
Designed for complex queries
Loaded via an ETL process
Optimized for read-heavy operations.
What is the definition of a data lake?
A storage repository that holds vast amounts of raw data in its native format including structured, semi-structured, and unstructured data.. Think about S3 or HDFS.
What are some characteristics of a data lake?
No predefined schema
Data is loaded as-is, not preprocessed
supports batch, realtime, and streaming processing
can be queried for data transformation or exploration
What is the difference between ELT and ETL
ETL is used with data warehouses. You extract the data, transform it, and the load it.
ELT is used with data lakes. You extract the data, load the data as needed, and then transform it.
What is the downside of a data warehouse?
It is less agile and could require schema and data changes.
What is traditionally more cost-effective, a data lake or data warehouse?
A data lake, but storage costs could exceed data warehouse costs.
What is a data lakehouse?
A hybrid of a data warehouse and a data lake. It can provide ACID transactions. An example is AWS Lake Formation.