ETL Flashcards by Fahima Hyder

Extract

Businesses store historical information or stream real-time data into many systems. This information is scattered across different software and is structured in various formats. The extraction phase entails defining required data sources and gathering data from them

How well did you know this?

Not at all

Perfectly

Transform

When the data is gathered from it’s sources, it’s usually placed in a temporary storage called a staging area. While placed in this area the data is formatted in accordance with defined standards and models. For example, financial numerics of different formats $34.50, 0.90 cents, 01,65 will be changed into a single coherent format: $34.50, $0.90, $1.65.

How well did you know this?

Not at all

Perfectly

Load

Final stage of ETL process is loading the structured and formatted data into a database. if the amount of data is small any kind of database can be used. A specific type of database used in BI, is called a Data warehouse

How well did you know this?

Not at all

Perfectly

Project Flow

Project Planning Phase
Project Design Phase

Data Modeling

Development Phase
Testing Phase
Production management

How well did you know this?

Not at all

Perfectly

Data warehousing

Specific data for specific use
relatively small

stores mainly structured data
more costly to update
optimized for data analysis

How well did you know this?

Not at all

Perfectly

Data Lake

Stores all raw data
Can be petabytes (1 mill GB)

Stores all data structures
Cost-effective
Difficult to analyze(AI/ML is best)
requires cataloging

How well did you know this?

Not at all

Perfectly

Structured data

Easy to search and organize
Consistent model, rows, and columns

Defined types
Can be grouped to form relations
Stored in relational databases
About 20% of the data is structured
Created and queried using SQL

How well did you know this?

Not at all

Perfectly

Semi structured data

Relatively easy to search and organize
Consistent model, less rigid implementation: different observations have different sizes
Different types
Can be grouped but needs more work
NoSQL databases: JSON, XML, YAML

How well did you know this?

Not at all

Perfectly

Unstructured data

Does not follow a model, can’t be contained in rows and columns
difficult to search and organize
usually text, sound, pictures, or videos
Usually stored in data lakes
Use AI to search and organize unstructured data

How well did you know this?

Not at all

Perfectly

SQL Database

If your data is structured use SQL
SQL is a great fit for transaction oriented systems such as customer relationship management tools, accounting software, and e-commerce platform
scale vertically - increase the capacity of a single server to scale database

How well did you know this?

Not at all

Perfectly

NoSQL database

Not Only SQL
Do not follow the relational model
Does not require data normalization
Schema Free
scale horizontally- add more servers to power your growing database
types - key-value pair based, column-oriented graph, graph-based, document based

How well did you know this?

Not at all

Perfectly

Normalized

Efficiently organizing data in a database
Minimize duplicate data
minimize or avoid data modification issues
only storing related data in a table
simplify queries

How well did you know this?

Not at all

Perfectly

Primary key

Primary Key: uniquely identifies each row in a table. Only one primary key per table. No null or duplicate values

How well did you know this?

Not at all

Perfectly

Foreign Key

A foreign key is a set of one or more columns in a table that refers to the primary key in another table. There aren’t any special code, configurations, or table definitions you need to place to officially “designate” a foreign key.

How well did you know this?

Not at all

Perfectly

OLTP- online transaction processing

captures and maintain transaction data in a database.
Row Oriented (stored per recored added per transaction)
emphasis on fast processing transactional processes like Insert Update and delete commands.
Works in real time.
The original source of data where business data is being recorded in real-time
Faster processing because queries are small and simple
Transactional database
Control and run essential business operations in real time.
Normalized databases for efficiency.
Example = retail, banking

How well did you know this?

Not at all

Perfectly

OLAP-online analytical processing

Denormalized databases for analysis
Analytical databases
Data is integrated from different OLTP systems
Complex query processing may take hours. Full-load from OLTP in batches is also time-intensive
Plan, solve problems, support decisions, discover hidden insights.
Based on select commands to aggregate data for reporting
column oriented (queries about subset of columns)
Examples - sales, inventory

API

Application Programming Interface, stands for the communication between a server and a client. Common functionalities include GET, POST, UPDATE, DELETE

Server sends back JSON

Database

Very general term that loosely defined as organized data stored and accessed on a computer
Data warehouse is a type of database

Data catalog for data lakes

What is the source of the data?
Where is the data used?

who is the owner of the data?
How often is this data updated?
Good practice in terms of data governance
Ensures reproducibility

Batches

Group records at intervals

Streams

Send individual records right away

Cloud computing Benefits

Rented
Don’t need space to hold to hold them

Use just the resources we need when we need them
The closer to the user the better
database reliability : data replication

Parallel computing benefits

Extra processing power
Reduced memory footprint
Solve Larger Problems in a short point of time.

Parallel computing disadvantages

The extra cost (i.e. increased execution time) incurred are due to data transfers, synchronization, communication, thread creation/destruction, etc. These costs can sometimes be quite large, and may actually exceed the gains due to parallelization.

Harder to debug

Better cooling technologies are required in case of clusters.

Need several processing units

Five V's of Data

``` Volume (how much) Variety (what kind) Velocity (how frequent) Veracity (how accurate) Value (how useful) ```

Data Engineer Tasks

Gather data from different sources Optimize database for analyses Remove corrupt data An engineer that develops, constructs, tests, and maintains architectures such as databases and large-scale processing systems

Tools for data engineering

Databases - MySQL, PostGres Scheduling - Airflow, Oozie Processing- Spark, Hive

Cloud computing disadvantages

Peak vs. quite moments: hard to optimize

Big three of could computing

AWS, AZURE, Google

Star Schema

on or more fact tables referencing any number of dimension table Facts: things that happened eg Product orders Dimensions: information on the world eg: Customer info

some examples of when not to normalize

When using several aggregate functions joins can get expensive especially if you have a TON of tables When rapid application development is more important than elegant design ex) prototyping phase if using a NoSQL database or storing unstructured data