Lecture 3 Flashcards
Data lake
A database that holds raw data in its narrative format. The data lake approach copies all the data that might be relevant. Whereas the data warehouse approach cleans the data, the data lake approach doesn’t clean the data. Mostly, the two exist next to each other because different employees have different needs. This approach is also called schema on read.
ETL
Data warehousing process that consists of:
- Extraction
- Transformation
- Load
Extraction
Extracting data from a production database.
Transformation
Converting the extracted data from its source from into its target form, so that it can be placed into a data warehouse. Transformation activities: transfer, cleaning, integration and extraction.
Load
Putting the data into the warehouse.
ETL data conversion
- Simple data type conversion.
- Complex data type conversion.
- Currency data type conversion.
- Language data type conversion.
ETL cleaning
- Data must be validated and filtered -> garbage in = garbage out (GIGO).
- Data in the real world is dirty.
- incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data.
- noisy: containing errors or outliers.
- inconsistent: containing discrepancies in code or names. - It is often neglected or casually handled.
- Problems exposed when data is summarized.
- Prevention is better, but who does the cleaning?
ETL load
Integral load:
- All the records are taken and processes by comparing them with the records in the data warehouse.
- Extreme case: load deletes DW tables and replaces them with complete new versions (= full load).
- When no time stamps are present it is difficult to do an incremental load.
Delta/Incremental load:
- Only the changed records are taken and processed systematically in the DW.
ETL tools
- Generation 0: do-it-yourself (generate purpose programming yourself).
- Generation 1: code generators.
- Generation 2: ETL engines -> highly optimized, metadata driven.
ETL tool vendors
- Informatica PowerCenter
- Bluesky Integration Studio
- SAP Data Integrator
- IBM WebSphere Data Integration Suite SAS enterprise data integration
- Oracle Warehouse Builder
- MS SQL Server Integration Services
BI front-end applications
- Query and reporting
- OLAP
- Data mining models
Query and reporting
- Static reports generated with SQL or SQL-generators via the meta-layer.
- Answers typical business questions.
- When a business question changes a little, the SQL query needs to be changed.
Methods of reporting
- A very popular method of reporting is reporting via a meta-layer query and reporting tool.
+ used when a company does not have SQL knowledgr.
+ creates a middle layer between the database and business.
+ can be done by Microsoft Access for example. - Reporting via SQL
OLAP
- Enables users to interactively analyze multi-dimensional data from multiple perspectives.
- Answers simple and complex business questions.
Multi-dimensional data
Data can be classified into measures and dimensions. Aggregating measures up to certain dimensions creates a multi-dimensional view on the data (= data cube).