Data Warehousing & ETL Flashcards
Data Warehouse
Definition & Usage/Benefits
Large warehouse of data that is built on top of a database or multiple databases, RDMS or MDBMS (database is the platform and warehouse is the usage). Data is recorded in the operational systems (database transactions) and is then copied to the warehouse.
Stores data in order to make data-driven decisions and houses needed data in a single location. Allows organizations to focus on analysis without needing to constantly integrate and transform the data.
Data Warehouse Rules
5 & their defintions
- Integrated environment: data comes from a number of different source systems are that send data to the warehouse.
- Subject Oriented: reorganize the data by subject area.
- Time variant: historical and current data.
- Non-volatile: refresh data in batches, data warehouse stays as-is in between refreshes. Allows for strategic planning in-between refreshes.
- Supports Data Driven Decision Making
Data Lake
definition & the 3 properties of big data w/ definitions
Data Lake is built for big data environments and does not need to sit on top of a relational database, unlike a warehouse.
3 V’s (volume, velocity, variety).
Volume: larger amount of data than a warehouse.
Velocity: processing these larger data volumes faster than traditional methods.
Variety: can hold structured, unstructured, and semi-structured data (holds the data as-is), while a warehouse usually only handles structured data.
Data Virtualization
definitions & when it is utilized - 3 components
An approach to data management that provides real time access to the source system of the data (not copying the data to a warehouse).
Used in niche use cases: 1. simple transformations, 2. smaller number of data sources, 3. not as concerned about response time of queries and reports.
Operational Data Store (ODS)
Definition
(When is it used?)
(How is it used as an intermediary?)
A design/architecture approach focused on current operational data/decision making. Often used for real-time feeds (current information, right now).
When ODS is an intermediary between the warehouse and the data source it is used as the staging layer.
Staging Layer
Definition, Two Types & Definitions
A Layer within warehouse prior to transformations. The “E” in ETL.
Non-Persistent: data is erased from the staging layer, only new data sits in the staging layer, but historical data sits in the transformation layer.
Persistent: historical data is not erased in the staging layer.
(refer to notes if more information is needed advantages/disadvantages)
Different Warehouse Designs/Architectures
4 different types & defintions
Centralized : everything in one single database (multiple tables).
Component-Based Data Warehousing: a combination of multiple databases and data marts.
Data Mart Only: more narrowly focused data warehouse.
Cubes: type of multidimensional data management system (MDBMS). Usually have RDMS + MDBMS Specialized dimensional database, best for smaller scale DW and DM
Centralized Warehouse Designs/Architecture
benefits, requirements (not prior to implementation) & drawbacks
Benefits: utilizing modern tech is more likely to succeed, more streamlined implementation and architectural simplicity.
Requirements: high cross-org cooperation, high data governance.
Risk: changes can ripple throughout the warehouse, causing unforseen consequences.
Component-Based Warehouse Designs/Architecture
4 benefits; 2 drawbacks
Benefits: allows for decomposition, mix and match technology, can overcome org challenges, and bolt together components
Risk: bolting together different components can be difficult to implement - inconsistent components, cross integration can be difficult.
Two Types of Component-Based Warehouses
Architected: Data warehouse & data marts or data marts only.
Non-Architected: utilized when organization disagrees about the business rules, isolated, use only as a last resort.
Two Types of Data Marts Warehouses
3 characteristics for both
Dependent data marts: pulls data from the warehouse, data is mostly uniform across different data marts, architecturally more straight forward.
Independent data mart - pulls in data directly from source applications, little uniformity across marts, architecture is more spaghetti in nature. Usually less data sources that a data warehouse.
ex. SAP implementation where you have a application that has its own data mart, some (not all) data is sent from the application to the data warehouse (reverse order of data transfer - instead or from warehouse to mart).
ETL v ELT
what is required prior to implementing ETL?
ETL (Extract, Transform, Load): used in traditional data warehousing to make data uniform across data sources. Requires business analysis and data modeling prior to implementation.
ELT (Extract, Load, Transform): blast data into a big data environment and then transform at some later time.
Incremental ETL (Definition)
The 4 Patterns & Defintions
(What is Initial ETL?)
Incrementally refresh the data when new/modified data becomes available with special handlings for deleting data (removing customer, keep in the historical record).
Patterns:
- append: add onto old information
- in-place update: making changes to portions of existing data
- complete replacement: all data overwritten
- rolling append: maintain only a certain duration of historical data, wipe oldest data that is no longer within the time window
Initial ETL is a one-time loading of the data.
Common Data Transformation Models
The 6 & Their Definitions
- data value unification: common taxonomy for similar fields (ex. similar grade lettering, United States [A, B, C…] v. United Kingdom [9, 8, 7…]
- data type and size unification: CHAR (50) across all data source’s full name fields
- de-duplication: removing duplicates
- dropping columns (vertical slicing): removing unnecessary data attributes
- value-based row filtering (horizontal slicing): removing unnecessary data fields based on predefined business rules, maybe aggregations as well
- correcting known errors
In the context of data warehousing, what are facts & dimensions?
Where are they stored?
Facts: measurements (numeric and quantifiable). Stored in fact tables.
Dimensions: context for measurements. Stored in dimension tables.
Ex. A Student’s GPA is 3.8.
Facts: 3.8
Dimensions: Student’s Name, GPA
Additive, Non-Additive, Semi-Additive Facts
What are these
Additive: facts that can be added together under all circumstances (overhead cost - total employee’s salaries).
Non-Additive: facts that cannot be summed together (all student’s combined GPA, summed up averages, percentages, ratios, margins - metrics created by division)
Semi-Additive: sometimes these metrics can be added together.
How to handle non-additive facts?
3 ways
- store underlying measurements in a fact table instead of the ratios or percentages
- store non-additive facts in individual rows for easy access
- calculate aggregate averages, ratio, percentages (sum all values together, then calculate)
Star Schema
3 principals; 4 characteristics
Principals:
1. all dimensions of a given hierarchy are in one dimension table
- only one level from the fact table
- visually represents a star
Characteristics:
1. generally fewer database joins
- primary-foreign key relationship are more straightforward
- typically more database storage
- generally “de-normalized” (in this context just means potentially repetition in dimension tables)
Snowflake Schema
3 principals; 4 characteristics
Principals:
1. each dimension level in a hierarchy has its own dimension table
- one or more levels from the fact table
- visually represents a snowflake
Characteristics:
1. generally more database joins
- primary-foreign key relationship are less straightforward
- typically less database storage
- generally “normalized” (in this context just means less repetition in dimension tables)
Natural Keys
definition and guidance on use - 3 principals
Definition: a type of unique key that exist in the external world and travels from source system to the warehouse, also referred to as a busines key
Ex. purchase order number, flight number, login name
Guidance:
- generally not good for foreign and primary keys
- keep them in the dimension tables
- remove from fact tables
Relationship between Fact and Dimension Tables
- Primary Keys are housed in the dimension tables.
- Foreign keys are housed in the fact tables. The primary key of the fact table is the combination of all foreign keys relating back to the dimension tables.
Fact tables can be linked other fact tables through dimension tables.
Fact Differences between Star & Snowflake Schemas
Star Schemas: dimension hierarchies are maintained in a signle table for each hierarchy (ex. school, department, class in a single table, but date would be in a different table). Primary and foreign key relationships are connected at the lowest level in the hierarchy.
Snowflake Schemas: dimension hierarchies are broken into individual tables (school, department, class in different dimension tables). The fact table is related to the lowest level dimension and that lowest dimension then relates to all other dimension in the hierarchy.
Four Types of Fact Tables & Definitions
Transaction_Grained: records business transactions.
Periodic Snapshot: measurements and regular intervals, either aggregates of transactions (ex. payments for a given week) or levels not related to a transaction (ex. inventory of a given week).
Accumulating Snapshot: progress of a business transaction through defined stages (how many days for a financial transaction to process).
Factless: record the occurence of a transaction without anything significant to measure (register for a class) or recording or a relationship among parties (student/advisor).
Slowly Changing Dimensions (SCD)
Defintion & Common Usage & Considerations
type 1: overwrite old data - inplace (single cell) update
usage: error corrections
considerations: track history of data changes (audit)
type 2: unlimited history - data updates are reflected as a new row of data, maintain old record as a row
usage: report on current changes and old records (ex. change in location of student)
considerations: best to handle reporting through multi-step queries, identify current version of the data with active flag (Y,N) and/or effective date (when an event took affect)
type 3: limited history - add new column to reflect changes in data
usage: reorganization with a company (sales division switch from north/south to east/central/west)
considerations: query would be extremely complext so build into table architecture