Data Engineering Basics Flashcards by Chris Lombardi

What are the three types of data?

Structured, Unstructured, Semistructured

How well did you know this?

Not at all

Perfectly

What is the definition of structured data?

Data that is organized in a manner or schema. Typically found in relational databases. Consistent structure and uses rows and columns.

How well did you know this?

Not at all

Perfectly

What is the definition of unstructured data?

Data that does not have a predefined structure. Examples include videos, audio files, images, emails, and work processing documents

How well did you know this?

Not at all

Perfectly

What is the definition of semi-structured data?

It has some structure in the form of tags, hierarchies or other patterns. XML and JSON is a good example of this.

How well did you know this?

Not at all

Perfectly

What is the definition of volume in data engineering terms?

It refers to the amount or size of the data. It could be GB, MB, PB.

How well did you know this?

Not at all

Perfectly

What is the definition of velocity in data engineering terms?

It refers to the speed at which new data is generated, collected, and processed.

How well did you know this?

Not at all

Perfectly

What is the definition of variety in data engineering terms?

It refers to the different types, structures, and sources of data. structured, unstructured, etc..

How well did you know this?

Not at all

Perfectly

What is the definition of a data warehouse?

It is a centralized repository optimized for analysis where data from different sources is stored in a structured format.

How well did you know this?

Not at all

Perfectly

What are some characteristics of a data warehouse?

Designed for complex queries
Loaded via an ETL process
Optimized for read-heavy operations.

How well did you know this?

Not at all

Perfectly

What is the definition of a data lake?

A storage repository that holds vast amounts of raw data in its native format including structured, semi-structured, and unstructured data.. Think about S3 or HDFS.

How well did you know this?

Not at all

Perfectly

What are some characteristics of a data lake?

No predefined schema

Data is loaded as-is, not preprocessed

supports batch, realtime, and streaming processing

can be queried for data transformation or exploration

How well did you know this?

Not at all

Perfectly

What is the difference between ELT and ETL

ETL is used with data warehouses. You extract the data, transform it, and the load it.

ELT is used with data lakes. You extract the data, load the data as needed, and then transform it.

How well did you know this?

Not at all

Perfectly

What is the downside of a data warehouse?

It is less agile and could require schema and data changes.

How well did you know this?

Not at all

Perfectly

What is traditionally more cost-effective, a data lake or data warehouse?

A data lake, but storage costs could exceed data warehouse costs.

How well did you know this?

Not at all

Perfectly

What is a data lakehouse?

A hybrid of a data warehouse and a data lake. It can provide ACID transactions. An example is AWS Lake Formation.

How well did you know this?

Not at all

Perfectly

What is the difference between ODBC and JDBC

Study These Flashcards

JDBC is platform independent, but requires you to use Java. ODBC is platform dependent, but you can use it with any language.

What is Avro?

Study These Flashcards

It is a binary format that stores both the data and its schema together. Good for big fata and real-time processing systems.

What is Parquet?

Study These Flashcards

A columnar storage format optimized for analytics. Good for large datasets with an analytics engine.

What is data lineage?

Study These Flashcards

A visual representation that traces the flow and transformation of data through its lifecycle from source to final destination. This helps tracking error back to the source. May be required for compliance.

What is schema eveolution?

Study These Flashcards

The ability to adapt and change the schema of a dataset over time without disrupting existing processes or systems. Maintains backward compatibility. An example is the Glue Schema Registry

What is stratified sampling?

Study These Flashcards

You divide your population into subgroups (strata) and randomly sample each one to ensure representation of all subgroups.

What is systemic sampling?

Study These Flashcards

An example is picking every 4th order.

What is data skew?

Study These Flashcards

It refers to the unequal distribution or imbalance of fata across various nodes or partitions.

What is data completeness?

Study These Flashcards

Ensures all required data is present and essential parts are not missing.