Data Engineering Fundamentals Flashcards
Data Engineering Fundamentals
What are the three main types of data?
Structured, Unstructured, and Semi-structured
Give an example of structured data.
Database table, CSV file, Excel spreadsheet
What is a key characteristic of unstructured data?
It doesn’t have a predefined schema or structure.
Give an example of unstructured data.
Image, audio file, email
What is semi-structured data?
Data with some organization, like tags or hierarchies, but not as rigid as structured data.
What are the three Vs of Big Data?
Properties of Data
Volume, Velocity, Variety
What does “Volume” refer to in the context of data?
The amount or size of data.
What does “Velocity” refer to in the context of data?
The speed at which data is generated and processed.
What does “Variety” refer to in the context of data?
The different types and sources of data.
What is a data warehouse optimized for?
Data Warehouses vs. Data Lakes
Complex queries and analysis of structured data.
What is a key characteristic of a data lake?
It can store vast amounts of raw data in its native format.
What is the schema approach for a data warehouse?
Schema-on-write (schema is defined before data is loaded)
What is the schema approach for a data lake?
Schema-on-read (schema is defined when data is read)
What is a data lakehouse?
Data Lakehouse and Data Mesh
A hybrid architecture combining features of data lakes and data warehouses.
What is a key concept of a data mesh?
Domain-based data management.
What does ETL stand for?
ETL Pipelines
Extract, Transform, Load
What happens in the “Transform” stage of an ETL pipeline?
Data is cleaned, converted, and prepared for the target system.