Data Engineering Intro Flashcards

Question 1

Q

The five V’s of big data

Answer

A

volume - how much?
variety - what kind?
velocity - how frequent?
veracity - how accurate?
value - how useful?

Question 2

Q

Data engineer tasks

Answer

A

Ingest and store data
Set up databases
Build data pipelines
Strong software skills

Question 3

Q

Data scientist tasks

Answer

A

Exploit data
Access databases
Use pipeline outputs
Strong analytical skills

Question 4

Q

Data pipelines

Answer

A

Extracting
Transforming
Combining
Validating
Loading

Question 5

Q

ETL

Answer

A

Popular framework

extract data
transform extracted data
load transformed data to another database

Question 6

Q

Structured data

Answer

A

Easy to search and organize
Consistent model, rows, and columns
Defined types
Can be grouped to form relations
Stored in relational databases
About 20% of the data is structured
Created and queried using SQL

Question 7

Q

Semi-structured data

Answer

A

Relatively easy to search and organize
Consistent model, less-right implementation: diff observations have diff sizes
Diff types
Can be grouped, but needs more work
NoSQL databases: JSON, XML, YAML

Question 8

Q

Unstructured data

Answer

A

Does not follow a model, can’t be contained in rows and columns
Difficult to search and organize
Usually text, sound, pic, or videos
Usually stored in data lakes, can appear in data warehouses or databases
Most data is unstructured
Use AI to search and organize unstructured data
Add some text to make it semi-structured

Question 9

Q

Data lake

Answer

A

Stores all the raw data
Can be petabytes (million GBs)
Stores all data structures
Cost-effective
Difficult to analyze
Requires an up-to-date data catalog
Used by data scientists
Big data, real-time analytics

Question 10

Q

Data warehouse

Answer

A

Specific data for specific use
Relatively small
Stores mainly structured data
More costly to update
Optimized for data analysis
Also used by data analysts and business analysts

Question 11

Q

Data catalog for data lakes

Answer

A

Record:
What is the source of this data
Where is this data used
Who is the wonder of the data
How often is the data updated

Question 12

Q

Scheduling

Answer

A

Can apply to any task listed in data processing
Scheduling is the glue of your system
Holds each piece and organize how they work together
Runs tasks in a specific order and resolves all dependencies

Question 13

Q

Parallel computing

Answer

A

Split tasks up into several smaller sub tasks