Data Warehouse Design - Chapter 1.4 Flashcards
What happens in the data staging layer?
The ETL processes that extract, integrate and clean data from operational sources to feed the data warehouse layer.
When does ETL happen?
When a data warehouse is populated for the first time. After that, it occurs every time the data warehouse is regularly updated.
How is the ETL stage often defined in the three-layer architecture and why?
Reconciliation.
- As the ETL processes feed the reconciled data layer, which in turn feeds the data warehouse. This is considered the most complex stage.
Which four stages are present in ETL?
- Extraction (capture)
- Cleansing (cleaning/scrubbing)
- Transformation
- Loading
*Note: Cleansing and transformation stage are often blurred together.
What is the main difference between the cleansing and the transformation stage?
Cleansing focusses on rectifying data values, whereas transformation is focussed on managing data formats.
Which three types of data extraction are possible?
- Static extraction
- Making a snapshot of operational data, done when you populate the data warehouse for the first time - Incremental extraction
- Used to update data warehouses. Based on a log maintained by the operational DBMS - Source driven extraction
- You rewrite your operational applications to give notifications of all the data changes
What happens in the cleansing stage of ETL?
Remove inconsistencies and mistakes that make the data dirty to improve data quality.
What are the most frequent mistakes with data that require cleansing?
- Duplicate data
- Inconsistent values that are logically associated
- Missing data
- Unexpected use of fields
- Imposible or wrong values
- Incosistent values for a single entity because different practices were used
- Inconsistent values for one individual entity because of typing mistakes
What are the main data cleansing features in ETL and what do they do?
- Rectification (Rectify mistakes)
- Homogenization (Make them similar)
-> Using specific dictionaries to rectify typing mistakes and to recognize synonyms.
What is rule-based cleansing?
A technique to achieve rectification and homogenization. It enforces domain-specific rules and defines appropiate associations between values.
What happens in the transformation phase?
Data is converted from its operational source format into a specific data warehouse format.
-> In a three-layer architecture, this results into the reconciled layer.
Which data aspects are rectified in the transformation phase?
- Loose texts that may hide valuable information
(BigDeal LtD might hide that it is a limted partnership company) - Different formats can be used for individual data
(Data saved as a string or as three integers)
What are the three main transformation processes in the transformation phase?
- Conversion and normalization that operate on both storage formats and units of measure to make data uniform.
- Matching that associates equivalent fields in different sources
- Selection that reduces the number of source fields and records.
Which transformation technique is used when populating a data warehouse after the transformation phase?
Denormalization.
so it requires aggregation to sum up data from the data warehouse as end-user
Which two ways of loading (into the data warehouse) exist?
- Refresh
Replacing al the older data, rewrite the database completely. Normally done at the beginning - Update
Only add the changes in data. Normally done to update.