Data Engineering Basics Flashcards
What are the three types of data?
Structured, Unstructured, Semistructured
What is the definition of structured data?
Data that is organized in a manner or schema. Typically found in relational databases. Consistent structure and uses rows and columns.
What is the definition of unstructured data?
Data that does not have a predefined structure. Examples include videos, audio files, images, emails, and work processing documents
What is the definition of semi-structured data?
It has some structure in the form of tags, hierarchies or other patterns. XML and JSON is a good example of this.
What is the definition of volume in data engineering terms?
It refers to the amount or size of the data. It could be GB, MB, PB.
What is the definition of velocity in data engineering terms?
It refers to the speed at which new data is generated, collected, and processed.
What is the definition of variety in data engineering terms?
It refers to the different types, structures, and sources of data. structured, unstructured, etc..
What is the definition of a data warehouse?
It is a centralized repository optimized for analysis where data from different sources is stored in a structured format.
What are some characteristics of a data warehouse?
Designed for complex queries
Loaded via an ETL process
Optimized for read-heavy operations.
What is the definition of a data lake?
A storage repository that holds vast amounts of raw data in its native format including structured, semi-structured, and unstructured data.. Think about S3 or HDFS.
What are some characteristics of a data lake?
No predefined schema
Data is loaded as-is, not preprocessed
supports batch, realtime, and streaming processing
can be queried for data transformation or exploration
What is the difference between ELT and ETL
ETL is used with data warehouses. You extract the data, transform it, and the load it.
ELT is used with data lakes. You extract the data, load the data as needed, and then transform it.
What is the downside of a data warehouse?
It is less agile and could require schema and data changes.
What is traditionally more cost-effective, a data lake or data warehouse?
A data lake, but storage costs could exceed data warehouse costs.
What is a data lakehouse?
A hybrid of a data warehouse and a data lake. It can provide ACID transactions. An example is AWS Lake Formation.
What is the difference between ODBC and JDBC
JDBC is platform independent, but requires you to use Java. ODBC is platform dependent, but you can use it with any language.
What is Avro?
It is a binary format that stores both the data and its schema together. Good for big fata and real-time processing systems.
What is Parquet?
A columnar storage format optimized for analytics. Good for large datasets with an analytics engine.
What is data lineage?
A visual representation that traces the flow and transformation of data through its lifecycle from source to final destination. This helps tracking error back to the source. May be required for compliance.
What is schema eveolution?
The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems. Maintains backward compatibility. An example is the Glue Schema Registry
What is stratified sampling?
You divide your population into subgroups (strata) and randomly sample each one to ensure representation of all subgroups.
What is systemic sampling?
An example is picking every 4th order.
What is data skew?
It refers to the unequal distribution or imbalance of fata across various nodes or partitions.
What is data completeness?
Ensures all required data is present and essential parts are not missing.
What is data consistency?
Ensures data values are consistent across datasets and do not contradict each other.
What is data accuracy?
Ensures data is correct and reliable
What is data integrity?
Ensures data maintains its correctness and consistency over its lifecycle and across systems.
What git command creates a repository?
git init
what git command lists all branches
git branch
what git command switches to a specific branch
git checkout
What git command merges branches?
git merge