Data Engineering Fundamentals Flashcards

Data Engineering Fundamentals

1
Q

What are the three main types of data?

A

Structured, Unstructured, and Semi-structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give an example of structured data.

A

Database table, CSV file, Excel spreadsheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a key characteristic of unstructured data?

A

It doesn’t have a predefined schema or structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give an example of unstructured data.

A

Image, audio file, email

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is semi-structured data?

A

Data with some organization, like tags or hierarchies, but not as rigid as structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three Vs of Big Data?

Properties of Data

A

Volume, Velocity, Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does “Volume” refer to in the context of data?

A

The amount or size of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does “Velocity” refer to in the context of data?

A

The speed at which data is generated and processed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does “Variety” refer to in the context of data?

A

The different types and sources of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a data warehouse optimized for?

Data Warehouses vs. Data Lakes

A

Complex queries and analysis of structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a key characteristic of a data lake?

A

It can store vast amounts of raw data in its native format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the schema approach for a data warehouse?

A

Schema-on-write (schema is defined before data is loaded)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the schema approach for a data lake?

A

Schema-on-read (schema is defined when data is read)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a data lakehouse?

Data Lakehouse and Data Mesh

A

A hybrid architecture combining features of data lakes and data warehouses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a key concept of a data mesh?

A

Domain-based data management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does ETL stand for?

ETL Pipelines

A

Extract, Transform, Load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What happens in the “Transform” stage of an ETL pipeline?

A

Data is cleaned, converted, and prepared for the target system.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is CSV commonly used for?

Data Formats

A

Storing data in a tabular format, often for spreadsheets or databases.

19
Q

What is JSON commonly used for?

A

Data interchange, especially in web applications.

20
Q

What is a key advantage of Avro?

A

It stores data and its schema together.

21
Q

What is Parquet optimized for?

A

Analytics and efficient querying of large datasets.

22
Q

What is data lineage?

Data Modeling and Lineage

A

Tracking the flow and transformation of data.

23
Q

What is a star schema?

A

A data modeling technique often used in data warehouses with fact tables and dimensions.

24
Q

What is the purpose of indexing in a database?

Database Performance Optimization

A

To speed up data retrieval.

25
Q

What is database partitioning?

A

Dividing a database into smaller, more manageable parts.

26
Q

What is the goal of stratified sampling?

Data Sampling

A

To ensure representation of different subgroups in a sample.

27
Q

What is data skew?

Data Skew

A

An imbalance of data across partitions in a distributed system.

28
Q

Give an example of a technique to address data skew.

A

Adaptive partitioning, salting, repartitioning.

29
Q

What is data completeness?

Data Validation

A

Ensuring all required data is present.

30
Q

What is data consistency?

A

Ensuring data values are consistent across different datasets

31
Q

What does the COUNT() function do in SQL?

SQL

A

Counts the number of rows.

32
Q

What does the GROUP BY clause do in SQL?

A

Groups rows based on the values in one or more columns.

33
Q

What is the purpose of pivoting in SQL?

A

To turn row-level data into columnar data.

34
Q

What does git init do?

Git

A

Initializes a new Git repository.

35
Q

What does git add do?

A

Adds changes to the staging area.

36
Q

What does git commit do?

A

Records changes to the repository.

37
Q

What does git branch do?

A

Lists, creates, or deletes branches.

38
Q

What does git checkout do?

A

Switches to a different branch.

39
Q

What does git merge do?

A

Combines changes from different branches.

40
Q

What does git push do?

A

Sends local commits to a remote repository.

41
Q

What does git pull do?

A

Fetches changes from a remote repository and merges them.

42
Q

What does git stash do?

A

Temporarily saves changes.

43
Q

What does git rebase do?

A

Reapplies commits onto a different base branch.