ETL Process Flashcards
-takes the raw data from operational systems and prepares it for the data warehouse
extract, transformation,and load (ETL)
-need handle different extract schedules
-need to join data from multiple sources for a single dimension.
-need to determine correct to use when multiple versions of a field are avaiable from different sources.
ETL Requirements
-reading and understanding source data and copying needed data into data staging for further processing.
-implemented using gateways, standard interfaces, third-party products or custom development
extraction
(logical extraction approaches)
-entire data source extrated-no need to track changes to data.
full
(logical extraction approaches)
-only data that has changed during specified time interval is extracted
-more efficient, but requires identifying chages to data in operational source systems.
incremental
(extraction in databases)
-captures changes made to tables either synchronously (using triggers) or asynchronously (usin redo log files).
change data capture
(extraction in databases)
-can be used if operational system had these indicatimg date/time a row was last modified.
timestamps
(extraction in databases)
-can be used if source system use range partitioning with a date key
partitioning
(extraction in databases)
-typicallt used in conjuction with timestamp columns - triiger updates timestamp column upon each change on the source table
trigger
online
-can be done via a reference to anohter dataase or schema.
offiline
-extraction process access data in structures outside source system.
–flat files (via SQL Plus, OCI or Pro C programs)
–SSIS packages
–export data utility (show example)
–backup files (via SQL Server Backup Utility)
Extraction in SQL Server
online
-extraction process connects directly to source system to extract data.
-can be done via a databse link or direct schema reference.
offiline
-extraction process access data in structures outside source system.
–flat files (via SQL Plus, OCI or Pro C programs)
–SSIS packages
–export data utility (show example)
–backup files (via SQL Server Backup Utility)
extraction in Oracle
(extraction)
-applications such a informatca and datastage can connect directly to Oracle using Native Oracle drivers.
-queries are designed in a GUI interface, then run by a server engine when
third-party tools
(common types of business rules -transformation)
-only keep records with certain data values or ranges
filtering
(common types of business rules -transformation)
-totaling recording into a summary record
summarizing
(common types of business rules -transformation)
-combining two or more source system record into one output or target record.
merging
(common types of business rules -transformation)
-converting codes to English description or data warehouse standard descriptions
transposing
(common types of business rules -transformation)
-converting codes from differen sources to data warehouse standard values
codifying
(common types of business rules -transformation)
-applying mathemtical formulas to rpoduce data to be stored in the data warehouse
derivations
(common types of business rules -transformation)
-ensuring the quality of the data-that it is consistent, is of a known, recognized value, and conforms to the metadata definition for it.
cleansing
-can be just about any type of data conversion
-during load are most efficient.
-can be done when fact data is inserted or as an update after the fact record has been loaded.
-don’t ignore or limit transformations because they are difficult!
-the difficult it is, the more important it is for the data warehouse designer to calculate if if possible.
transformation
(transformation applications)
-each transformation implemented as a separate SQL operation, creating a temporary staging table.
multistage
(transformation applications)
-“transform-whole-load”-transformations performed during loading process, eliminating need to create temporary staging tables.
pipelined
process of loading transformed records into the warehouse.
-additional preprocessing may be done during load.
–sort
–aggregate
–compute views
–build indexes
–partition
-huge volumes of data to be loaded, yet small time window (usually at night) when the warehouse can be taken off-line.
-typically handled by batch load utlities
load
(loading in oracle)
-SQL Loader
-external tables
-OCI and Direct Path APIs
-export/import
loading mechanisms