Data Engineering Fundamentals Flashcards by Hải Yến Trịnh

What are the three main types of data?

Structured, Unstructured, and Semi-structured

How well did you know this?

Not at all

Perfectly

Give an example of structured data.

Database table, CSV file, Excel spreadsheet

How well did you know this?

Not at all

Perfectly

What is a key characteristic of unstructured data?

It doesn’t have a predefined schema or structure.

How well did you know this?

Not at all

Perfectly

Give an example of unstructured data.

Image, audio file, email

How well did you know this?

Not at all

Perfectly

What is semi-structured data?

Data with some organization, like tags or hierarchies, but not as rigid as structured data.

How well did you know this?

Not at all

Perfectly

What are the three Vs of Big Data?

Properties of Data

Volume, Velocity, Variety

How well did you know this?

Not at all

Perfectly

What does “Volume” refer to in the context of data?

The amount or size of data.

How well did you know this?

Not at all

Perfectly

What does “Velocity” refer to in the context of data?

The speed at which data is generated and processed.

How well did you know this?

Not at all

Perfectly

What does “Variety” refer to in the context of data?

The different types and sources of data.

How well did you know this?

Not at all

Perfectly

What is a data warehouse optimized for?

Data Warehouses vs. Data Lakes

Complex queries and analysis of structured data.

How well did you know this?

Not at all

Perfectly

What is a key characteristic of a data lake?

It can store vast amounts of raw data in its native format.

How well did you know this?

Not at all

Perfectly

What is the schema approach for a data warehouse?

Schema-on-write (schema is defined before data is loaded)

How well did you know this?

Not at all

Perfectly

What is the schema approach for a data lake?

Schema-on-read (schema is defined when data is read)

How well did you know this?

Not at all

Perfectly

What is a data lakehouse?

Data Lakehouse and Data Mesh

A hybrid architecture combining features of data lakes and data warehouses.

How well did you know this?

Not at all

Perfectly

What is a key concept of a data mesh?

Domain-based data management.

How well did you know this?

Not at all

Perfectly

What does ETL stand for?

ETL Pipelines

Extract, Transform, Load

How well did you know this?

Not at all

Perfectly

What happens in the “Transform” stage of an ETL pipeline?

Data is cleaned, converted, and prepared for the target system.

How well did you know this?

Not at all

Perfectly

What is CSV commonly used for?

Data Formats

Study These Flashcards

Storing data in a tabular format, often for spreadsheets or databases.

What is JSON commonly used for?

Study These Flashcards

Data interchange, especially in web applications.

What is a key advantage of Avro?

Study These Flashcards

It stores data and its schema together.

What is Parquet optimized for?

Study These Flashcards

Analytics and efficient querying of large datasets.

What is data lineage?

Data Modeling and Lineage

Study These Flashcards

Tracking the flow and transformation of data.

What is a star schema?

Study These Flashcards

A data modeling technique often used in data warehouses with fact tables and dimensions.

What is the purpose of indexing in a database?

Database Performance Optimization

Study These Flashcards

To speed up data retrieval.

What is database partitioning?

Dividing a database into smaller, more manageable parts.

What is the goal of stratified sampling? ## Footnote Data Sampling

To ensure representation of different subgroups in a sample.

What is data skew? ## Footnote Data Skew

An imbalance of data across partitions in a distributed system.

Give an example of a technique to address data skew.

Adaptive partitioning, salting, repartitioning.

What is data completeness? ## Footnote Data Validation

Ensuring all required data is present.

What is data consistency?

Ensuring data values are consistent across different datasets

What does the COUNT() function do in SQL? ## Footnote SQL

Counts the number of rows.

What does the GROUP BY clause do in SQL?

Groups rows based on the values in one or more columns.

What is the purpose of pivoting in SQL?

To turn row-level data into columnar data.

What does git init do? ## Footnote Git

Initializes a new Git repository.

What does git add do?

Adds changes to the staging area.

What does git commit do?

Records changes to the repository.

What does git branch do?

Lists, creates, or deletes branches.

What does git checkout do?

Switches to a different branch.

What does git merge do?

Combines changes from different branches.

What does git push do?

Sends local commits to a remote repository.

What does git pull do?

Fetches changes from a remote repository and merges them.

What does git stash do?

Temporarily saves changes.

What does git rebase do?

Reapplies commits onto a different base branch.

Data Engineering Fundamentals Flashcards

Data Engineering Fundamentals (43 cards)