Data Engineering Intro Flashcards
The five V’s of big data
- volume - how much?
- variety - what kind?
- velocity - how frequent?
- veracity - how accurate?
- value - how useful?
Data engineer tasks
Ingest and store data
Set up databases
Build data pipelines
Strong software skills
Data scientist tasks
Exploit data
Access databases
Use pipeline outputs
Strong analytical skills
Data pipelines
Extracting Transforming Combining Validating Loading
ETL
Popular framework
- extract data
- transform extracted data
- load transformed data to another database
Structured data
Easy to search and organize Consistent model, rows, and columns Defined types Can be grouped to form relations Stored in relational databases About 20% of the data is structured Created and queried using SQL
Semi-structured data
Relatively easy to search and organize Consistent model, less-right implementation: diff observations have diff sizes Diff types Can be grouped, but needs more work NoSQL databases: JSON, XML, YAML
Unstructured data
Does not follow a model, can’t be contained in rows and columns
Difficult to search and organize
Usually text, sound, pic, or videos
Usually stored in data lakes, can appear in data warehouses or databases
Most data is unstructured
Use AI to search and organize unstructured data
Add some text to make it semi-structured
Data lake
Stores all the raw data Can be petabytes (million GBs) Stores all data structures Cost-effective Difficult to analyze Requires an up-to-date data catalog Used by data scientists Big data, real-time analytics
Data warehouse
Specific data for specific use Relatively small Stores mainly structured data More costly to update Optimized for data analysis Also used by data analysts and business analysts
Data catalog for data lakes
Record: What is the source of this data Where is this data used Who is the wonder of the data How often is the data updated
Scheduling
Can apply to any task listed in data processing
Scheduling is the glue of your system
Holds each piece and organize how they work together
Runs tasks in a specific order and resolves all dependencies
Parallel computing
Split tasks up into several smaller sub tasks