Data Warehousing Flashcards
How did the data warehousing evolution work?
Since 1970, automate business processes to offer more efficient and cost-effective services.
This resulted in growing data in operational databases
What is a data warehouse ?
Information warehouse. A SUBJECT-ORIENTED, INTEGREATED, TIME-VARIANT AND NON-VOLATILE collection of data in support of management decision making process.
Define all the types of data stored in a data warehouse
SUBJECT- ORIENTED DATA: Organised around major subjects of the enterprise(sales, customer products). To store decision-support data.
INTEGRATED DATA: Integrates corporate application-oriented data from different source systems which includes data that is inconsistent.
Integrated data must be consistent to present a unified view to the users.
TIME-VARIANT DATA: Data is accurate and valid at some time interval.
Time-variance is also shown in extended time that data is held. Data represents a series of snapshots.
NON-VOLATILE DATA:data in warehouse in not updated in real time but is refreshed from OS on regular basis. New data is added as a SUPPLEMENT. DB absorbs data and integrates it with the previous data.
Examples of data warehouse queries and it’s examples
Types of queries that DW is expected to answer ranges from simple to highly complex and dependent on type of end-user access tools, which are OLAP, Data mining, Reporting, query, and application development tools and Executive information systems (EIS).
Examples are: * What was the total revenue for Scotland in the third quarter of 2004?
* What was the total revenue for property sales for each type of property in
Great Britain in 2003?
* What are the three most popular areas in each city for the renting of
property in 2004 and how does this compare with the figures for the
previous two years?
What are the benefits and problems of data warehousing
Benefits:
*Potential high returns on investment.
*Competitive advantage
*Increased productivity of corporate decision-makers.
Problems:
1.Hidden problems with source systems
2.Data ownership
3.High maintenance
4.Increased user-end demands
5.Required data not captured
Comparison of OLTP and Data Warehousing
Refer to the image
What is a data webhouse
CLICKSTREAM- data generated by behavioral data as individuals interact through web browsers with remote web sites.
It is a distributed data warehouse with no central repository. Implemented over web to harness clickstream.
What are the operation data sources and store (ODS) ?
Data sources:
MAINFRAME DATA: held in 1st generation hierarchical and network DB.
DEPARTMENTAL DATA: Held in propriety file systems(VSAM, RMS) and relational DBMS(oracle).
PRIVATE DATA: Held in servers.
EXTERNAL SYSTEMS: Internet or DB with organizations suppliers or customers.
ODS: A repository of current and integrated data used for analysis. Structured and supplied same way as DW. STAGING AREA for data to be moved to DW.
This is created when Operational sys incapable of achieving requirements.
What are the functions of Load, Warehouse and Query Manager ?
LOAD MANAGER: Performs ETL(Extraction, Transformation and Loading) of data into the warehouse.
Size and complexity varies and constructed using a combination of vendor loading and custom built tools.
WAREHOUSE MANAGER: All operations with management of data in DW. Constructed using vendor data management tools and custom-built programs.
Operations performed are: Analysis of data to ensure consistency, Generation of DENORMALIZATIONS AND AGGREGATIONS, Backing up and archiving data, creation of INDEXES AND VIEWS. Also generates query profiles.
QUERY MANAGER:All operation with management of user queries.
Constructed using vendor end-user data access tools, data
warehouse monitoring tools,
database facilities, and custom-built programs.
Operations perfomed: DIRECTING QUERIES AND SCHEDULING EXECUTION OF QUERIES.
What is Detailed Data, Lightly and highly summarized and Archive data ?
DETAILED: stores all data in DB Schema.
Data is not stored online but aggregated to next level of detail. Data is added to DW to SUPPLEMENT the aggregated data.
LIGHTLY AND HIGHLY SUMMARIZED DATA: * Stores all the pre-defined lightly and highly aggregated data
generated by the warehouse manager.
* Transient, as it will be subject to change on an on-going basis in order to respond to changing query profiles.
Summary data is used to speed up performance and its updated continuously as new data.
ARCHIVE DATA: stores detailed and summarized data for the purposes of archiving and backup. Necessary to backup online summary data. This data is transferred to storage archives like magnetic tape/ optical disk.
What is metadata
This area stores all metadata(data about data) definitions used by all processes in warehouse.
Used for these purposes:
ETL-map data sources to common view of info.
Warehouse management-automate production of summary tables.
Query-direct query to most appropriate data source.
Structure of metadata differ based on the purpose- multiple copies of metadata within warehouse.
Copy management tools- understand mapping rules to convert source data into common form.
End user access tools-how to build a query.
Write a note about end-user access tools
DW- used to provide strategic decision making and it must support AD HOC and routine analysis.
High performance- pre-planning the requirements for joins, summations etc.
What are the associated processes with information flows in DW ?
1.Inflow- ETL from source to DW.
2. Upflow- Adding value to DW through summarizing, packaging.
3.Downflow- Archiving and recovering data in warehouse.
4.Outflow- making data available to end users.
5.Metaflow- management of metadata.
What are the DBMS requirements are what are the parallel DB techno ?
DBMS requirements:
*Load performance, processing.
*Query performance
*data quality management
*mass user scalability
DB parallel techno:
*use multiple nodes to work on the same problem.
*divide individual tasks into smaller parts
*Parallel DBMS - parallel queries,
parallel data loading and table scanning as well as data archiving and back up.
Functions on administration and management tools
*Monitoring data loading from multiple sources
*Data quality and integrity checks.
*Security management
*Archiving data
*Managing and updating metadata.