ETL Flashcards

1
Q

Extract

A

Businesses store historical information or stream real-time data into many systems. This information is scattered across different software and is structured in various formats. The extraction phase entails defining required data sources and gathering data from them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Transform

A

When the data is gathered from it’s sources, it’s usually placed in a temporary storage called a staging area. While placed in this area the data is formatted in accordance with defined standards and models. For example, financial numerics of different formats $34.50, 0.90 cents, 01,65 will be changed into a single coherent format: $34.50, $0.90, $1.65.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Load

A

Final stage of ETL process is loading the structured and formatted data into a database. if the amount of data is small any kind of database can be used. A specific type of database used in BI, is called a Data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Project Flow

A

Project Planning Phase
Project Design Phase

Data Modeling

Development Phase
Testing Phase
Production management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data warehousing

A

Specific data for specific use
relatively small

stores mainly structured data
more costly to update
optimized for data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Lake

A

Stores all raw data
Can be petabytes (1 mill GB)

Stores all data structures
Cost-effective
Difficult to analyze(AI/ML is best)
requires cataloging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Structured data

A

Easy to search and organize
Consistent model, rows, and columns

Defined types
Can be grouped to form relations
Stored in relational databases
About 20% of the data is structured
Created and queried using SQL
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Semi structured data

A
Relatively easy to search and organize
Consistent model, less rigid implementation: different observations have different sizes
Different types
Can be grouped but needs more work
NoSQL databases: JSON, XML, YAML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unstructured data

A

Does not follow a model, can’t be contained in rows and columns
difficult to search and organize
usually text, sound, pictures, or videos
Usually stored in data lakes
Use AI to search and organize unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SQL Database

A

If your data is structured use SQL
SQL is a great fit for transaction oriented systems such as customer relationship management tools, accounting software, and e-commerce platform
scale vertically - increase the capacity of a single server to scale database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

NoSQL database

A

Not Only SQL
Do not follow the relational model
Does not require data normalization
Schema Free
scale horizontally- add more servers to power your growing database
types - key-value pair based, column-oriented graph, graph-based, document based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Normalized

A

Efficiently organizing data in a database
Minimize duplicate data
minimize or avoid data modification issues
only storing related data in a table
simplify queries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Primary key

A

Primary Key: uniquely identifies each row in a table. Only one primary key per table. No null or duplicate values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Foreign Key

A

A foreign key is a set of one or more columns in a table that refers to the primary key in another table. There aren’t any special code, configurations, or table definitions you need to place to officially “designate” a foreign key.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

OLTP- online transaction processing

A

captures and maintain transaction data in a database.
Row Oriented (stored per recored added per transaction)
emphasis on fast processing transactional processes like Insert Update and delete commands.
Works in real time.
The original source of data where business data is being recorded in real-time
Faster processing because queries are small and simple
Transactional database
Control and run essential business operations in real time.
Normalized databases for efficiency.
Example = retail, banking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OLAP-online analytical processing

A

Denormalized databases for analysis
Analytical databases
Data is integrated from different OLTP systems
Complex query processing may take hours. Full-load from OLTP in batches is also time-intensive
Plan, solve problems, support decisions, discover hidden insights.
Based on select commands to aggregate data for reporting
column oriented (queries about subset of columns)
Examples - sales, inventory

17
Q

API

A

Application Programming Interface, stands for the communication between a server and a client. Common functionalities include GET, POST, UPDATE, DELETE

Server sends back JSON

18
Q

Database

A

Very general term that loosely defined as organized data stored and accessed on a computer
Data warehouse is a type of database

19
Q

Data catalog for data lakes

A

What is the source of the data?
Where is the data used?

who is the owner of the data?
How often is this data updated?
Good practice in terms of data governance
Ensures reproducibility

20
Q

Batches

A

Group records at intervals

21
Q

Streams

A

Send individual records right away

22
Q

Cloud computing Benefits

A

Rented
Don’t need space to hold to hold them

Use just the resources we need when we need them
The closer to the user the better
database reliability : data replication

23
Q

Parallel computing benefits

A

Extra processing power
Reduced memory footprint
Solve Larger Problems in a short point of time.

24
Q

Parallel computing disadvantages

A

The extra cost (i.e. increased execution time) incurred are due to data transfers, synchronization, communication, thread creation/destruction, etc. These costs can sometimes be quite large, and may actually exceed the gains due to parallelization.

Harder to debug

Better cooling technologies are required in case of clusters.

Need several processing units

25
Q

Five V’s of Data

A
Volume (how much)
Variety (what kind)
Velocity (how frequent)
Veracity (how accurate)
Value (how useful)
26
Q

Data Engineer Tasks

A

Gather data from different sources
Optimize database for analyses
Remove corrupt data

An engineer that develops, constructs, tests, and maintains architectures such as databases and large-scale processing systems

27
Q

Tools for data engineering

A

Databases - MySQL, PostGres
Scheduling - Airflow, Oozie
Processing- Spark, Hive

28
Q

Cloud computing disadvantages

A

Peak vs. quite moments: hard to optimize

29
Q

Big three of could computing

A

AWS, AZURE, Google

30
Q

Star Schema

A

on or more fact tables referencing any number of dimension table

Facts: things that happened eg Product orders
Dimensions: information on the world eg: Customer info

31
Q

some examples of when not to normalize

A

When using several aggregate functions joins can get expensive especially if you have a TON of tables

When rapid application development is more important than elegant design ex) prototyping phase

if using a NoSQL database or storing unstructured data