Data Warehousing & ETL Flashcards by Jonathan Woolridge

Data Warehouse

Definition & Usage/Benefits

Large warehouse of data that is built on top of a database or multiple databases, RDMS or MDBMS (database is the platform and warehouse is the usage). Data is recorded in the operational systems (database transactions) and is then copied to the warehouse.

Stores data in order to make data-driven decisions and houses needed data in a single location. Allows organizations to focus on analysis without needing to constantly integrate and transform the data.

How well did you know this?

Not at all

Perfectly

Data Warehouse Rules

5 & their defintions

Integrated environment: data comes from a number of different source systems are that send data to the warehouse.
Subject Oriented: reorganize the data by subject area.
Time variant: historical and current data.
Non-volatile: refresh data in batches, data warehouse stays as-is in between refreshes. Allows for strategic planning in-between refreshes.
Supports Data Driven Decision Making

How well did you know this?

Not at all

Perfectly

Data Lake

definition & the 3 properties of big data w/ definitions

Data Lake is built for big data environments and does not need to sit on top of a relational database, unlike a warehouse.

3 V’s (volume, velocity, variety).

Volume: larger amount of data than a warehouse.

Velocity: processing these larger data volumes faster than traditional methods.

Variety: can hold structured, unstructured, and semi-structured data (holds the data as-is), while a warehouse usually only handles structured data.

How well did you know this?

Not at all

Perfectly

Data Virtualization

definitions & when it is utilized - 3 components

An approach to data management that provides real time access to the source system of the data (not copying the data to a warehouse).

Used in niche use cases: 1. simple transformations, 2. smaller number of data sources, 3. not as concerned about response time of queries and reports.

How well did you know this?

Not at all

Perfectly

Operational Data Store (ODS)

Definition
(When is it used?)
(How is it used as an intermediary?)

A design/architecture approach focused on current operational data/decision making. Often used for real-time feeds (current information, right now).
When ODS is an intermediary between the warehouse and the data source it is used as the staging layer.

How well did you know this?

Not at all

Perfectly

Staging Layer

Definition, Two Types & Definitions

A Layer within warehouse prior to transformations. The “E” in ETL.

Non-Persistent: data is erased from the staging layer, only new data sits in the staging layer, but historical data sits in the transformation layer.

Persistent: historical data is not erased in the staging layer.

(refer to notes if more information is needed advantages/disadvantages)

How well did you know this?

Not at all

Perfectly

Different Warehouse Designs/Architectures

4 different types & defintions

Centralized : everything in one single database (multiple tables).

Component-Based Data Warehousing: a combination of multiple databases and data marts.

Data Mart Only: more narrowly focused data warehouse.

Cubes: type of multidimensional data management system (MDBMS). Usually have RDMS + MDBMS Specialized dimensional database, best for smaller scale DW and DM

How well did you know this?

Not at all

Perfectly

Centralized Warehouse Designs/Architecture

benefits, requirements (not prior to implementation) & drawbacks

Benefits: utilizing modern tech is more likely to succeed, more streamlined implementation and architectural simplicity.

Requirements: high cross-org cooperation, high data governance.

Risk: changes can ripple throughout the warehouse, causing unforseen consequences.

How well did you know this?

Not at all

Perfectly

Component-Based Warehouse Designs/Architecture

4 benefits; 2 drawbacks

Benefits: allows for decomposition, mix and match technology, can overcome org challenges, and bolt together components

Risk: bolting together different components can be difficult to implement - inconsistent components, cross integration can be difficult.

How well did you know this?

Not at all

Perfectly

Two Types of Component-Based Warehouses

Architected: Data warehouse & data marts or data marts only.

Non-Architected: utilized when organization disagrees about the business rules, isolated, use only as a last resort.

How well did you know this?

Not at all

Perfectly

Two Types of Data Marts Warehouses

3 characteristics for both

Dependent data marts: pulls data from the warehouse, data is mostly uniform across different data marts, architecturally more straight forward.

Independent data mart - pulls in data directly from source applications, little uniformity across marts, architecture is more spaghetti in nature. Usually less data sources that a data warehouse.

ex. SAP implementation where you have a application that has its own data mart, some (not all) data is sent from the application to the data warehouse (reverse order of data transfer - instead or from warehouse to mart).

How well did you know this?

Not at all

Perfectly

ETL v ELT

what is required prior to implementing ETL?

ETL (Extract, Transform, Load): used in traditional data warehousing to make data uniform across data sources. Requires business analysis and data modeling prior to implementation.

ELT (Extract, Load, Transform): blast data into a big data environment and then transform at some later time.

How well did you know this?

Not at all

Perfectly

Incremental ETL (Definition)

The 4 Patterns & Defintions
(What is Initial ETL?)

Incrementally refresh the data when new/modified data becomes available with special handlings for deleting data (removing customer, keep in the historical record).

Patterns:

append: add onto old information
in-place update: making changes to portions of existing data
complete replacement: all data overwritten
rolling append: maintain only a certain duration of historical data, wipe oldest data that is no longer within the time window

Initial ETL is a one-time loading of the data.

How well did you know this?

Not at all

Perfectly

Common Data Transformation Models

The 6 & Their Definitions

data value unification: common taxonomy for similar fields (ex. similar grade lettering, United States [A, B, C…] v. United Kingdom [9, 8, 7…]
data type and size unification: CHAR (50) across all data source’s full name fields
de-duplication: removing duplicates
dropping columns (vertical slicing): removing unnecessary data attributes
value-based row filtering (horizontal slicing): removing unnecessary data fields based on predefined business rules, maybe aggregations as well
correcting known errors

How well did you know this?

Not at all

Perfectly

In the context of data warehousing, what are facts & dimensions?

Where are they stored?

Facts: measurements (numeric and quantifiable). Stored in fact tables.

Dimensions: context for measurements. Stored in dimension tables.

Ex. A Student’s GPA is 3.8.
Facts: 3.8
Dimensions: Student’s Name, GPA

How well did you know this?

Not at all

Perfectly

Additive, Non-Additive, Semi-Additive Facts

What are these

Additive: facts that can be added together under all circumstances (overhead cost - total employee’s salaries).

Non-Additive: facts that cannot be summed together (all student’s combined GPA, summed up averages, percentages, ratios, margins - metrics created by division)

Semi-Additive: sometimes these metrics can be added together.

How to handle non-additive facts?

3 ways

store underlying measurements in a fact table instead of the ratios or percentages
store non-additive facts in individual rows for easy access
calculate aggregate averages, ratio, percentages (sum all values together, then calculate)

Star Schema

3 principals; 4 characteristics

Principals:
1. all dimensions of a given hierarchy are in one dimension table

only one level from the fact table
visually represents a star

Characteristics:
1. generally fewer database joins

primary-foreign key relationship are more straightforward
typically more database storage
generally “de-normalized” (in this context just means potentially repetition in dimension tables)

Snowflake Schema

3 principals; 4 characteristics

Principals:
1. each dimension level in a hierarchy has its own dimension table

one or more levels from the fact table
visually represents a snowflake

Characteristics:
1. generally more database joins

primary-foreign key relationship are less straightforward
typically less database storage
generally “normalized” (in this context just means less repetition in dimension tables)

Natural Keys

definition and guidance on use - 3 principals

Definition: a type of unique key that exist in the external world and travels from source system to the warehouse, also referred to as a busines key

Ex. purchase order number, flight number, login name

Guidance:

generally not good for foreign and primary keys
keep them in the dimension tables
remove from fact tables

Relationship between Fact and Dimension Tables

Primary Keys are housed in the dimension tables.
Foreign keys are housed in the fact tables. The primary key of the fact table is the combination of all foreign keys relating back to the dimension tables.

Fact tables can be linked other fact tables through dimension tables.

Fact Differences between Star & Snowflake Schemas

Star Schemas: dimension hierarchies are maintained in a signle table for each hierarchy (ex. school, department, class in a single table, but date would be in a different table). Primary and foreign key relationships are connected at the lowest level in the hierarchy.

Snowflake Schemas: dimension hierarchies are broken into individual tables (school, department, class in different dimension tables). The fact table is related to the lowest level dimension and that lowest dimension then relates to all other dimension in the hierarchy.

Four Types of Fact Tables & Definitions

Transaction_Grained: records business transactions.

Periodic Snapshot: measurements and regular intervals, either aggregates of transactions (ex. payments for a given week) or levels not related to a transaction (ex. inventory of a given week).

Accumulating Snapshot: progress of a business transaction through defined stages (how many days for a financial transaction to process).

Factless: record the occurence of a transaction without anything significant to measure (register for a class) or recording or a relationship among parties (student/advisor).

Slowly Changing Dimensions (SCD)

Defintion & Common Usage & Considerations

type 1: overwrite old data - inplace (single cell) update

usage: error corrections
considerations: track history of data changes (audit)

type 2: unlimited history - data updates are reflected as a new row of data, maintain old record as a row

usage: report on current changes and old records (ex. change in location of student)
considerations: best to handle reporting through multi-step queries, identify current version of the data with active flag (Y,N) and/or effective date (when an event took affect)

type 3: limited history - add new column to reflect changes in data

usage: reorganization with a company (sales division switch from north/south to east/central/west)
considerations: query would be extremely complext so build into table architecture

Best Practices in ETL Design (3)

1. Limit the amount of incoming data 2. Process dimension before fact tables (primary keys sit in the dimension table) 3. Look for parellel processing opportunities (jobs running in parellel)

Incremental ETL (5 General Gteps)

1. data prep 2. transformations 3. process new dimension rows 4. process type one (SCDs) 5. process type two (SCDs)