Data Engineering Intro Flashcards

1
Q

The five V’s of big data

A
  • volume - how much?
  • variety - what kind?
  • velocity - how frequent?
  • veracity - how accurate?
  • value - how useful?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data engineer tasks

A

Ingest and store data
Set up databases
Build data pipelines
Strong software skills

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data scientist tasks

A

Exploit data
Access databases
Use pipeline outputs
Strong analytical skills

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data pipelines

A
Extracting
Transforming
Combining
Validating
Loading
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ETL

A

Popular framework

  • extract data
  • transform extracted data
  • load transformed data to another database
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Structured data

A
Easy to search and organize
Consistent model, rows, and columns
Defined types
Can be grouped to form relations
Stored in relational databases
About 20% of the data is structured
Created and queried using SQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Semi-structured data

A
Relatively easy to search and organize
Consistent model, less-right implementation: diff observations have diff sizes
Diff types
Can be grouped, but needs more work
NoSQL databases: JSON, XML, YAML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unstructured data

A

Does not follow a model, can’t be contained in rows and columns
Difficult to search and organize
Usually text, sound, pic, or videos
Usually stored in data lakes, can appear in data warehouses or databases
Most data is unstructured
Use AI to search and organize unstructured data
Add some text to make it semi-structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Data lake

A
Stores all the raw data
Can be petabytes (million GBs)
Stores all data structures
Cost-effective
Difficult to analyze
Requires an up-to-date data catalog
Used by data scientists
Big data, real-time analytics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data warehouse

A
Specific data for specific use
Relatively small
Stores mainly structured data
More costly to update
Optimized for data analysis
Also used by data analysts and business analysts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data catalog for data lakes

A
Record:
What is the source of this data
Where is this data used
Who is the wonder of the data
How often is the data updated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Scheduling

A

Can apply to any task listed in data processing
Scheduling is the glue of your system
Holds each piece and organize how they work together
Runs tasks in a specific order and resolves all dependencies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parallel computing

A

Split tasks up into several smaller sub tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly