Data Warehousing & ETL Flashcards
Data Warehouse
Definition & Usage/Benefits
Large warehouse of data that is built on top of a database or multiple databases, RDMS or MDBMS (database is the platform and warehouse is the usage). Data is recorded in the operational systems (database transactions) and is then copied to the warehouse.
Stores data in order to make data-driven decisions and houses needed data in a single location. Allows organizations to focus on analysis without needing to constantly integrate and transform the data.
Data Warehouse Rules
5 & their defintions
- Integrated environment: data comes from a number of different source systems are that send data to the warehouse.
- Subject Oriented: reorganize the data by subject area.
- Time variant: historical and current data.
- Non-volatile: refresh data in batches, data warehouse stays as-is in between refreshes. Allows for strategic planning in-between refreshes.
- Supports Data Driven Decision Making
Data Lake
definition & the 3 properties of big data w/ definitions
Data Lake is built for big data environments and does not need to sit on top of a relational database, unlike a warehouse.
3 V’s (volume, velocity, variety).
Volume: larger amount of data than a warehouse.
Velocity: processing these larger data volumes faster than traditional methods.
Variety: can hold structured, unstructured, and semi-structured data (holds the data as-is), while a warehouse usually only handles structured data.
Data Virtualization
definitions & when it is utilized - 3 components
An approach to data management that provides real time access to the source system of the data (not copying the data to a warehouse).
Used in niche use cases: 1. simple transformations, 2. smaller number of data sources, 3. not as concerned about response time of queries and reports.
Operational Data Store (ODS)
Definition
(When is it used?)
(How is it used as an intermediary?)
A design/architecture approach focused on current operational data/decision making. Often used for real-time feeds (current information, right now).
When ODS is an intermediary between the warehouse and the data source it is used as the staging layer.
Staging Layer
Definition, Two Types & Definitions
A Layer within warehouse prior to transformations. The “E” in ETL.
Non-Persistent: data is erased from the staging layer, only new data sits in the staging layer, but historical data sits in the transformation layer.
Persistent: historical data is not erased in the staging layer.
(refer to notes if more information is needed advantages/disadvantages)
Different Warehouse Designs/Architectures
4 different types & defintions
Centralized : everything in one single database (multiple tables).
Component-Based Data Warehousing: a combination of multiple databases and data marts.
Data Mart Only: more narrowly focused data warehouse.
Cubes: type of multidimensional data management system (MDBMS). Usually have RDMS + MDBMS Specialized dimensional database, best for smaller scale DW and DM
Centralized Warehouse Designs/Architecture
benefits, requirements (not prior to implementation) & drawbacks
Benefits: utilizing modern tech is more likely to succeed, more streamlined implementation and architectural simplicity.
Requirements: high cross-org cooperation, high data governance.
Risk: changes can ripple throughout the warehouse, causing unforseen consequences.
Component-Based Warehouse Designs/Architecture
4 benefits; 2 drawbacks
Benefits: allows for decomposition, mix and match technology, can overcome org challenges, and bolt together components
Risk: bolting together different components can be difficult to implement - inconsistent components, cross integration can be difficult.
Two Types of Component-Based Warehouses
Architected: Data warehouse & data marts or data marts only.
Non-Architected: utilized when organization disagrees about the business rules, isolated, use only as a last resort.
Two Types of Data Marts Warehouses
3 characteristics for both
Dependent data marts: pulls data from the warehouse, data is mostly uniform across different data marts, architecturally more straight forward.
Independent data mart - pulls in data directly from source applications, little uniformity across marts, architecture is more spaghetti in nature. Usually less data sources that a data warehouse.
ex. SAP implementation where you have a application that has its own data mart, some (not all) data is sent from the application to the data warehouse (reverse order of data transfer - instead or from warehouse to mart).
ETL v ELT
what is required prior to implementing ETL?
ETL (Extract, Transform, Load): used in traditional data warehousing to make data uniform across data sources. Requires business analysis and data modeling prior to implementation.
ELT (Extract, Load, Transform): blast data into a big data environment and then transform at some later time.
Incremental ETL (Definition)
The 4 Patterns & Defintions
(What is Initial ETL?)
Incrementally refresh the data when new/modified data becomes available with special handlings for deleting data (removing customer, keep in the historical record).
Patterns:
- append: add onto old information
- in-place update: making changes to portions of existing data
- complete replacement: all data overwritten
- rolling append: maintain only a certain duration of historical data, wipe oldest data that is no longer within the time window
Initial ETL is a one-time loading of the data.
Common Data Transformation Models
The 6 & Their Definitions
- data value unification: common taxonomy for similar fields (ex. similar grade lettering, United States [A, B, C…] v. United Kingdom [9, 8, 7…]
- data type and size unification: CHAR (50) across all data source’s full name fields
- de-duplication: removing duplicates
- dropping columns (vertical slicing): removing unnecessary data attributes
- value-based row filtering (horizontal slicing): removing unnecessary data fields based on predefined business rules, maybe aggregations as well
- correcting known errors
In the context of data warehousing, what are facts & dimensions?
Where are they stored?
Facts: measurements (numeric and quantifiable). Stored in fact tables.
Dimensions: context for measurements. Stored in dimension tables.
Ex. A Student’s GPA is 3.8.
Facts: 3.8
Dimensions: Student’s Name, GPA