Data warehousing process that consists of: - Extraction - Transformation - Load

- Simple data type conversion. - Complex data type conversion. - Currency data type conversion. - Language data type conversion.

1. Data must be validated and filtered -> garbage in = garbage out (GIGO). 2. Data in the real world is dirty. - incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data. - noisy: containing errors or outliers. - inconsistent: containing discrepancies in code or names. 3. It is often neglected or casually handled. 4. Problems exposed when data is summarized. 5. Prevention is better, but who does the cleaning?

1. Generation 0: do-it-yourself (generate purpose programming yourself). 2. Generation 1: code generators. 3. Generation 2: ETL engines -> highly optimized, metadata driven.

1. Informatica PowerCenter 2. Bluesky Integration Studio 3. SAP Data Integrator 4. IBM WebSphere Data Integration Suite SAS enterprise data integration 5. Oracle Warehouse Builder 6. MS SQL Server Integration Services

- Static reports generated with SQL or SQL-generators via the meta-layer. - Answers typical business questions. - When a business question changes a little, the SQL query needs to be changed.

- A very popular method of reporting is reporting via a meta-layer query and reporting tool. + used when a company does not have SQL knowledgr. + creates a middle layer between the database and business. + can be done by Microsoft Access for example. - Reporting via SQL

- Enables users to interactively analyze multi-dimensional data from multiple perspectives. - Answers simple and complex business questions.

Lecture 3 Flashcards by Sjoerd Span

Data lake

A database that holds raw data in its narrative format. The data lake approach copies all the data that might be relevant. Whereas the data warehouse approach cleans the data, the data lake approach doesn’t clean the data. Mostly, the two exist next to each other because different employees have different needs. This approach is also called schema on read.

How well did you know this?

Not at all

Perfectly

ETL

Data warehousing process that consists of:

Extraction
Transformation
Load

How well did you know this?

Not at all

Perfectly

Extraction

Extracting data from a production database.

How well did you know this?

Not at all

Perfectly

Transformation

Converting the extracted data from its source from into its target form, so that it can be placed into a data warehouse. Transformation activities: transfer, cleaning, integration and extraction.

How well did you know this?

Not at all

Perfectly

Load

Putting the data into the warehouse.

How well did you know this?

Not at all

Perfectly

ETL data conversion

Simple data type conversion.
Complex data type conversion.
Currency data type conversion.
Language data type conversion.

How well did you know this?

Not at all

Perfectly

ETL cleaning

Data must be validated and filtered -> garbage in = garbage out (GIGO).
Data in the real world is dirty.
- incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
- noisy: containing errors or outliers.
- inconsistent: containing discrepancies in code or names.
It is often neglected or casually handled.
Problems exposed when data is summarized.
Prevention is better, but who does the cleaning?

How well did you know this?

Not at all

Perfectly

ETL load

Integral load:
- All the records are taken and processes by comparing them with the records in the data warehouse.
- Extreme case: load deletes DW tables and replaces them with complete new versions (= full load).
- When no time stamps are present it is difficult to do an incremental load.
Delta/Incremental load:
- Only the changed records are taken and processed systematically in the DW.

How well did you know this?

Not at all

Perfectly

ETL tools

Generation 0: do-it-yourself (generate purpose programming yourself).
Generation 1: code generators.
Generation 2: ETL engines -> highly optimized, metadata driven.

How well did you know this?

Not at all

Perfectly

ETL tool vendors

Informatica PowerCenter
Bluesky Integration Studio
SAP Data Integrator
IBM WebSphere Data Integration Suite SAS enterprise data integration
Oracle Warehouse Builder
MS SQL Server Integration Services

How well did you know this?

Not at all

Perfectly

BI front-end applications

Query and reporting
OLAP
Data mining models

How well did you know this?

Not at all

Perfectly

Query and reporting

Static reports generated with SQL or SQL-generators via the meta-layer.
Answers typical business questions.
When a business question changes a little, the SQL query needs to be changed.

How well did you know this?

Not at all

Perfectly

Methods of reporting

A very popular method of reporting is reporting via a meta-layer query and reporting tool.
+ used when a company does not have SQL knowledgr.
+ creates a middle layer between the database and business.
+ can be done by Microsoft Access for example.
Reporting via SQL

How well did you know this?

Not at all

Perfectly

OLAP

Enables users to interactively analyze multi-dimensional data from multiple perspectives.
Answers simple and complex business questions.

How well did you know this?

Not at all

Perfectly

Multi-dimensional data

Data can be classified into measures and dimensions. Aggregating measures up to certain dimensions creates a multi-dimensional view on the data (= data cube).

How well did you know this?

Not at all

Perfectly

Measures

Study These Flashcards

Summable information concerning a business process.

Dimensions

Study These Flashcards

Represent different perspectives on viewing measures and dimensions are organized hierarchically. Hierarchy represents different levels of aggregation.

Fact

Study These Flashcards

When a cell in a data cube has a value.

OLAP operators

Study These Flashcards

Roll up: aggregating measures to a higher dimension level -> rollup on time from quarters to year.
Drill down: reverse of roll up -> drill down on time from quarters to months.
Slice (& dice): selecting subset of cells that satisfy a certain selection condition -> slice for time = Q1. Reverse is unslice and undice.
Pivot (or rotate): rotates data axes.

Data mining models

Study These Flashcards

Objective is to discover new valuable knowledge in databases.
Answers questions like:
+ What customers are likely to respond to my promotion.
+ Which customer is at risk of leaving.

Modeling OLAP databases, define

Study These Flashcards

Measures
Dimensions
Dimension hierarchies

Star schema

Study These Flashcards

A fact table in the middle connected to a set of dimension tables. It is a simple model that limits the number of joins for a better query performance.
In a star schema the primary key is composed of the primary keys of the different tables.
There is data redundancy due to denormalization (a lot is put in one table with a lot of rows). This improves query performance and less joins are needed.

Snowflake schema (normalized)

Study These Flashcards

A refinement of Star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to a snowflake. A Snowflake schema is used when the data volume is getting too large.

Why ? Tackles disadvantages of the Star schema, because it is less restricted, smaller data volumes, and the possibility to store aggregations.
However, you lose advantages of the Star schema. Is is not easily understood by end-users, high number of joins and no predictable/standard framework.

Sparsity

Study These Flashcards

Arises when not every cell in the cube is filled with data.

Density

of actual rows / # of potential rows

How to deal with changes in dimension attributes, or store type changes, etc.

Solutions by Kimball: 1. Overwrite old values. 2. Add 'version number' field as part of the key. 3. Add new column(s) to dimension table.

How to deal with infrequent changes

Overwrite old value = restating. - Kills information - Use for correction of errors - Sometimes history is not important...

How to deal with small individual changes

Add 'version number' field as part of the key. - Insert new record with new version number. - Optional: use additional fields like 'date from' - 'date to' - Time-sliced version of customer/store/... - Most popular solution - No referential integrity anymore, or refer with complete key.

How to deal with suddenly big changes

Add new column(s) to dimension table. - Save old values in new columns - New values and old values in table - Analysis with old columns is still possible - Eventually, remove added column(s)

Lecture 3 Flashcards

(29 cards)