ETL Flashcards
Extract
Businesses store historical information or stream real-time data into many systems. This information is scattered across different software and is structured in various formats. The extraction phase entails defining required data sources and gathering data from them
Transform
When the data is gathered from it’s sources, it’s usually placed in a temporary storage called a staging area. While placed in this area the data is formatted in accordance with defined standards and models. For example, financial numerics of different formats $34.50, 0.90 cents, 01,65 will be changed into a single coherent format: $34.50, $0.90, $1.65.
Load
Final stage of ETL process is loading the structured and formatted data into a database. if the amount of data is small any kind of database can be used. A specific type of database used in BI, is called a Data warehouse
Project Flow
Project Planning Phase
Project Design Phase
Data Modeling
Development Phase
Testing Phase
Production management
Data warehousing
Specific data for specific use
relatively small
stores mainly structured data
more costly to update
optimized for data analysis
Data Lake
Stores all raw data
Can be petabytes (1 mill GB)
Stores all data structures
Cost-effective
Difficult to analyze(AI/ML is best)
requires cataloging
Structured data
Easy to search and organize
Consistent model, rows, and columns
Defined types Can be grouped to form relations Stored in relational databases About 20% of the data is structured Created and queried using SQL
Semi structured data
Relatively easy to search and organize Consistent model, less rigid implementation: different observations have different sizes Different types Can be grouped but needs more work NoSQL databases: JSON, XML, YAML
Unstructured data
Does not follow a model, can’t be contained in rows and columns
difficult to search and organize
usually text, sound, pictures, or videos
Usually stored in data lakes
Use AI to search and organize unstructured data
SQL Database
If your data is structured use SQL
SQL is a great fit for transaction oriented systems such as customer relationship management tools, accounting software, and e-commerce platform
scale vertically - increase the capacity of a single server to scale database
NoSQL database
Not Only SQL
Do not follow the relational model
Does not require data normalization
Schema Free
scale horizontally- add more servers to power your growing database
types - key-value pair based, column-oriented graph, graph-based, document based
Normalized
Efficiently organizing data in a database
Minimize duplicate data
minimize or avoid data modification issues
only storing related data in a table
simplify queries
Primary key
Primary Key: uniquely identifies each row in a table. Only one primary key per table. No null or duplicate values
Foreign Key
A foreign key is a set of one or more columns in a table that refers to the primary key in another table. There aren’t any special code, configurations, or table definitions you need to place to officially “designate” a foreign key.
OLTP- online transaction processing
captures and maintain transaction data in a database.
Row Oriented (stored per recored added per transaction)
emphasis on fast processing transactional processes like Insert Update and delete commands.
Works in real time.
The original source of data where business data is being recorded in real-time
Faster processing because queries are small and simple
Transactional database
Control and run essential business operations in real time.
Normalized databases for efficiency.
Example = retail, banking
OLAP-online analytical processing
Denormalized databases for analysis
Analytical databases
Data is integrated from different OLTP systems
Complex query processing may take hours. Full-load from OLTP in batches is also time-intensive
Plan, solve problems, support decisions, discover hidden insights.
Based on select commands to aggregate data for reporting
column oriented (queries about subset of columns)
Examples - sales, inventory
API
Application Programming Interface, stands for the communication between a server and a client. Common functionalities include GET, POST, UPDATE, DELETE
Server sends back JSON
Database
Very general term that loosely defined as organized data stored and accessed on a computer
Data warehouse is a type of database
Data catalog for data lakes
What is the source of the data?
Where is the data used?
who is the owner of the data?
How often is this data updated?
Good practice in terms of data governance
Ensures reproducibility
Batches
Group records at intervals
Streams
Send individual records right away
Cloud computing Benefits
Rented
Don’t need space to hold to hold them
Use just the resources we need when we need them
The closer to the user the better
database reliability : data replication
Parallel computing benefits
Extra processing power
Reduced memory footprint
Solve Larger Problems in a short point of time.
Parallel computing disadvantages
The extra cost (i.e. increased execution time) incurred are due to data transfers, synchronization, communication, thread creation/destruction, etc. These costs can sometimes be quite large, and may actually exceed the gains due to parallelization.
Harder to debug
Better cooling technologies are required in case of clusters.
Need several processing units
Five V’s of Data
Volume (how much) Variety (what kind) Velocity (how frequent) Veracity (how accurate) Value (how useful)
Data Engineer Tasks
Gather data from different sources
Optimize database for analyses
Remove corrupt data
An engineer that develops, constructs, tests, and maintains architectures such as databases and large-scale processing systems
Tools for data engineering
Databases - MySQL, PostGres
Scheduling - Airflow, Oozie
Processing- Spark, Hive
Cloud computing disadvantages
Peak vs. quite moments: hard to optimize
Big three of could computing
AWS, AZURE, Google
Star Schema
on or more fact tables referencing any number of dimension table
Facts: things that happened eg Product orders
Dimensions: information on the world eg: Customer info
some examples of when not to normalize
When using several aggregate functions joins can get expensive especially if you have a TON of tables
When rapid application development is more important than elegant design ex) prototyping phase
if using a NoSQL database or storing unstructured data