Data Engineering Fundamentals Flashcards
Data Engineering Fundamentals
What are the three main types of data?
Structured, Unstructured, and Semi-structured
Give an example of structured data.
Database table, CSV file, Excel spreadsheet
What is a key characteristic of unstructured data?
It doesn’t have a predefined schema or structure.
Give an example of unstructured data.
Image, audio file, email
What is semi-structured data?
Data with some organization, like tags or hierarchies, but not as rigid as structured data.
What are the three Vs of Big Data?
Properties of Data
Volume, Velocity, Variety
What does “Volume” refer to in the context of data?
The amount or size of data.
What does “Velocity” refer to in the context of data?
The speed at which data is generated and processed.
What does “Variety” refer to in the context of data?
The different types and sources of data.
What is a data warehouse optimized for?
Data Warehouses vs. Data Lakes
Complex queries and analysis of structured data.
What is a key characteristic of a data lake?
It can store vast amounts of raw data in its native format.
What is the schema approach for a data warehouse?
Schema-on-write (schema is defined before data is loaded)
What is the schema approach for a data lake?
Schema-on-read (schema is defined when data is read)
What is a data lakehouse?
Data Lakehouse and Data Mesh
A hybrid architecture combining features of data lakes and data warehouses.
What is a key concept of a data mesh?
Domain-based data management.
What does ETL stand for?
ETL Pipelines
Extract, Transform, Load
What happens in the “Transform” stage of an ETL pipeline?
Data is cleaned, converted, and prepared for the target system.
What is CSV commonly used for?
Data Formats
Storing data in a tabular format, often for spreadsheets or databases.
What is JSON commonly used for?
Data interchange, especially in web applications.
What is a key advantage of Avro?
It stores data and its schema together.
What is Parquet optimized for?
Analytics and efficient querying of large datasets.
What is data lineage?
Data Modeling and Lineage
Tracking the flow and transformation of data.
What is a star schema?
A data modeling technique often used in data warehouses with fact tables and dimensions.
What is the purpose of indexing in a database?
Database Performance Optimization
To speed up data retrieval.
What is database partitioning?
Dividing a database into smaller, more manageable parts.
What is the goal of stratified sampling?
Data Sampling
To ensure representation of different subgroups in a sample.
What is data skew?
Data Skew
An imbalance of data across partitions in a distributed system.
Give an example of a technique to address data skew.
Adaptive partitioning, salting, repartitioning.
What is data completeness?
Data Validation
Ensuring all required data is present.
What is data consistency?
Ensuring data values are consistent across different datasets
What does the COUNT() function do in SQL?
SQL
Counts the number of rows.
What does the GROUP BY clause do in SQL?
Groups rows based on the values in one or more columns.
What is the purpose of pivoting in SQL?
To turn row-level data into columnar data.
What does git init do?
Git
Initializes a new Git repository.
What does git add do?
Adds changes to the staging area.
What does git commit do?
Records changes to the repository.
What does git branch do?
Lists, creates, or deletes branches.
What does git checkout do?
Switches to a different branch.
What does git merge do?
Combines changes from different branches.
What does git push do?
Sends local commits to a remote repository.
What does git pull do?
Fetches changes from a remote repository and merges them.
What does git stash do?
Temporarily saves changes.
What does git rebase do?
Reapplies commits onto a different base branch.