Data Warehousing & ETL Flashcards

1
Q

Data Warehouse

Definition & Usage/Benefits

A

Large warehouse of data that is built on top of a database or multiple databases, RDMS or MDBMS (database is the platform and warehouse is the usage). Data is recorded in the operational systems (database transactions) and is then copied to the warehouse.

Stores data in order to make data-driven decisions and houses needed data in a single location. Allows organizations to focus on analysis without needing to constantly integrate and transform the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Warehouse Rules

5 & their defintions

A
  1. Integrated environment: data comes from a number of different source systems are that send data to the warehouse.
  2. Subject Oriented: reorganize the data by subject area.
  3. Time variant: historical and current data.
  4. Non-volatile: refresh data in batches, data warehouse stays as-is in between refreshes. Allows for strategic planning in-between refreshes.
  5. Supports Data Driven Decision Making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Lake

definition & the 3 properties of big data w/ definitions

A

Data Lake is built for big data environments and does not need to sit on top of a relational database, unlike a warehouse.

3 V’s (volume, velocity, variety).

Volume: larger amount of data than a warehouse.

Velocity: processing these larger data volumes faster than traditional methods.

Variety: can hold structured, unstructured, and semi-structured data (holds the data as-is), while a warehouse usually only handles structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Virtualization

definitions & when it is utilized - 3 components

A

An approach to data management that provides real time access to the source system of the data (not copying the data to a warehouse).

Used in niche use cases: 1. simple transformations, 2. smaller number of data sources, 3. not as concerned about response time of queries and reports.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Operational Data Store (ODS)

Definition
(When is it used?)
(How is it used as an intermediary?)

A

A design/architecture approach focused on current operational data/decision making. Often used for real-time feeds (current information, right now).
When ODS is an intermediary between the warehouse and the data source it is used as the staging layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Staging Layer

Definition, Two Types & Definitions

A

A Layer within warehouse prior to transformations. The “E” in ETL.

Non-Persistent: data is erased from the staging layer, only new data sits in the staging layer, but historical data sits in the transformation layer.

Persistent: historical data is not erased in the staging layer.

(refer to notes if more information is needed advantages/disadvantages)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Different Warehouse Designs/Architectures

4 different types & defintions

A

Centralized : everything in one single database (multiple tables).

Component-Based Data Warehousing: a combination of multiple databases and data marts.

Data Mart Only: more narrowly focused data warehouse.

Cubes: type of multidimensional data management system (MDBMS). Usually have RDMS + MDBMS Specialized dimensional database, best for smaller scale DW and DM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Centralized Warehouse Designs/Architecture

benefits, requirements (not prior to implementation) & drawbacks

A

Benefits: utilizing modern tech is more likely to succeed, more streamlined implementation and architectural simplicity.

Requirements: high cross-org cooperation, high data governance.

Risk: changes can ripple throughout the warehouse, causing unforseen consequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Component-Based Warehouse Designs/Architecture

4 benefits; 2 drawbacks

A

Benefits: allows for decomposition, mix and match technology, can overcome org challenges, and bolt together components

Risk: bolting together different components can be difficult to implement - inconsistent components, cross integration can be difficult.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Two Types of Component-Based Warehouses

A

Architected: Data warehouse & data marts or data marts only.

Non-Architected: utilized when organization disagrees about the business rules, isolated, use only as a last resort.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Two Types of Data Marts Warehouses

3 characteristics for both

A

Dependent data marts: pulls data from the warehouse, data is mostly uniform across different data marts, architecturally more straight forward.

Independent data mart - pulls in data directly from source applications, little uniformity across marts, architecture is more spaghetti in nature. Usually less data sources that a data warehouse.

ex. SAP implementation where you have a application that has its own data mart, some (not all) data is sent from the application to the data warehouse (reverse order of data transfer - instead or from warehouse to mart).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ETL v ELT

what is required prior to implementing ETL?

A

ETL (Extract, Transform, Load): used in traditional data warehousing to make data uniform across data sources. Requires business analysis and data modeling prior to implementation.

ELT (Extract, Load, Transform): blast data into a big data environment and then transform at some later time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Incremental ETL (Definition)

The 4 Patterns & Defintions
(What is Initial ETL?)

A

Incrementally refresh the data when new/modified data becomes available with special handlings for deleting data (removing customer, keep in the historical record).

Patterns:

  1. append: add onto old information
  2. in-place update: making changes to portions of existing data
  3. complete replacement: all data overwritten
  4. rolling append: maintain only a certain duration of historical data, wipe oldest data that is no longer within the time window

Initial ETL is a one-time loading of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Common Data Transformation Models

The 6 & Their Definitions

A
  1. data value unification: common taxonomy for similar fields (ex. similar grade lettering, United States [A, B, C…] v. United Kingdom [9, 8, 7…]
  2. data type and size unification: CHAR (50) across all data source’s full name fields
  3. de-duplication: removing duplicates
  4. dropping columns (vertical slicing): removing unnecessary data attributes
  5. value-based row filtering (horizontal slicing): removing unnecessary data fields based on predefined business rules, maybe aggregations as well
  6. correcting known errors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

In the context of data warehousing, what are facts & dimensions?

Where are they stored?

A

Facts: measurements (numeric and quantifiable). Stored in fact tables.

Dimensions: context for measurements. Stored in dimension tables.

Ex. A Student’s GPA is 3.8.
Facts: 3.8
Dimensions: Student’s Name, GPA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Additive, Non-Additive, Semi-Additive Facts

What are these

A

Additive: facts that can be added together under all circumstances (overhead cost - total employee’s salaries).

Non-Additive: facts that cannot be summed together (all student’s combined GPA, summed up averages, percentages, ratios, margins - metrics created by division)

Semi-Additive: sometimes these metrics can be added together.

17
Q

How to handle non-additive facts?

3 ways

A
  1. store underlying measurements in a fact table instead of the ratios or percentages
  2. store non-additive facts in individual rows for easy access
  3. calculate aggregate averages, ratio, percentages (sum all values together, then calculate)
18
Q

Star Schema

3 principals; 4 characteristics

A

Principals:
1. all dimensions of a given hierarchy are in one dimension table

  1. only one level from the fact table
  2. visually represents a star

Characteristics:
1. generally fewer database joins

  1. primary-foreign key relationship are more straightforward
  2. typically more database storage
  3. generally “de-normalized” (in this context just means potentially repetition in dimension tables)
19
Q

Snowflake Schema

3 principals; 4 characteristics

A

Principals:
1. each dimension level in a hierarchy has its own dimension table

  1. one or more levels from the fact table
  2. visually represents a snowflake

Characteristics:
1. generally more database joins

  1. primary-foreign key relationship are less straightforward
  2. typically less database storage
  3. generally “normalized” (in this context just means less repetition in dimension tables)
20
Q

Natural Keys

definition and guidance on use - 3 principals

A

Definition: a type of unique key that exist in the external world and travels from source system to the warehouse, also referred to as a busines key

Ex. purchase order number, flight number, login name

Guidance:

  • generally not good for foreign and primary keys
  • keep them in the dimension tables
  • remove from fact tables
21
Q

Relationship between Fact and Dimension Tables

A
  1. Primary Keys are housed in the dimension tables.
  2. Foreign keys are housed in the fact tables. The primary key of the fact table is the combination of all foreign keys relating back to the dimension tables.

Fact tables can be linked other fact tables through dimension tables.

22
Q

Fact Differences between Star & Snowflake Schemas

A

Star Schemas: dimension hierarchies are maintained in a signle table for each hierarchy (ex. school, department, class in a single table, but date would be in a different table). Primary and foreign key relationships are connected at the lowest level in the hierarchy.

Snowflake Schemas: dimension hierarchies are broken into individual tables (school, department, class in different dimension tables). The fact table is related to the lowest level dimension and that lowest dimension then relates to all other dimension in the hierarchy.

23
Q

Four Types of Fact Tables & Definitions

A

Transaction_Grained: records business transactions.

Periodic Snapshot: measurements and regular intervals, either aggregates of transactions (ex. payments for a given week) or levels not related to a transaction (ex. inventory of a given week).

Accumulating Snapshot: progress of a business transaction through defined stages (how many days for a financial transaction to process).

Factless: record the occurence of a transaction without anything significant to measure (register for a class) or recording or a relationship among parties (student/advisor).

24
Q

Slowly Changing Dimensions (SCD)

Defintion & Common Usage & Considerations

A

type 1: overwrite old data - inplace (single cell) update

usage: error corrections
considerations: track history of data changes (audit)

type 2: unlimited history - data updates are reflected as a new row of data, maintain old record as a row

usage: report on current changes and old records (ex. change in location of student)
considerations: best to handle reporting through multi-step queries, identify current version of the data with active flag (Y,N) and/or effective date (when an event took affect)

type 3: limited history - add new column to reflect changes in data

usage: reorganization with a company (sales division switch from north/south to east/central/west)
considerations: query would be extremely complext so build into table architecture

25
Q

Best Practices in ETL Design (3)

A
  1. Limit the amount of incoming data
  2. Process dimension before fact tables (primary keys sit in the dimension table)
  3. Look for parellel processing opportunities (jobs running in parellel)
26
Q

Incremental ETL (5 General Gteps)

A
  1. data prep
  2. transformations
  3. process new dimension rows
  4. process type one (SCDs)
  5. process type two (SCDs)