Data Warehousing Flashcards
Data warehouse
Decision support database that is maintained separately from the organization’s operational database. It supports information processing by providing a solid platform of consolidated, historical data for analysis.
Subject-Oriented Data Warehouses
Organized around major subjects like customers, products and sales. It focuses on modelling and analysis of data for decision makers which is not on daily operations or transaction processing. Provides a simple and concise view around particular subject issues by excluding data that isn’t useful in the decision support process.
Integrated Data Warehouses
constructed by integrating multiple heterogeneous data sources such as relational DBs, flat files, Excel file and transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures etc. among different data sources. When data is moved to the warehouse it is converted.
Time Variant in Data Warehouses
the time horizon for data warehouses is much longer than of operational systems. As they provide historical data from the past 5-10 years. Every structure in the data warehouse contains an element of time, explicitly or implicitly, but the key of operational data may or may not contain a time element.
Non-Volatile Data Warehouse
Physically separate store of data transformed from the operational environment. Operational update of data doesn’t occur in the data warehouse environment. It doesn’t require transaction processing, recovery and concurrency control mechanisms. It requires only two operations in accessing data: initial loading of data and access of data.
heterogeneous DBMS
integrates disparate databases to provide a unified query interface for users. This system relies on a query-driven approach
A data warehouse is an _____ driven system. What does it focus on
update driven system that integrates and stores information from heterogeneous sources in advance. It focuses on high performance for querying and analysis
Operational DBMS
focuses on Online Transaction Processing (OLTP) while data warehouse focuses on Online Analytical Processing (OLAP)
Fact Table
Contains measurable data or metrics about business processes ex sales
Dimension Table
Contains descriptive attributes that label and filter data ex. Product Names
Why do we need a separate warehouse?
- Performance: OLTP systems are tuned for transaction processing whilst OLAP is optimized for complex analytical queries, multidimensional views and data aggregation
- Historical Data: Operation databases typically lack historical data, which is critical for decision making
- Data Consolidation: Decision-making often requires aggregated and summarized data from multiple heterogeneous sources.
- Data Quality: Ensures consistency in data representation, codes and formats from diverse sources
Data Warehouse Design Overview
Key Design Perspectives
1 Top-Down View: Identifies relevant information needed for the data warehouse.
2 Data Source View: Exposes data being captured and managed by operational systems
3 Data Warehouse View: Defines fact tables and dimension tables for analytical use.
4 Business Query View: Represents how end users will query and interact with the data
Design Process
Top Down: Focuses on overall planning and design
Bottom Up: Starts with prototypes and experiments
Waterfall: Structured, step by step process
Spiral: Iterative development with quick functional prototypes.
Design Steps
1 Choose a business process: Identify processes to model. Ex orders.
2 Define the grain: decide the atomic level of data granularity (ex. Individual orders or daily summaries)
3 Select Dimensions: Identify descriptive attributes (ex. Time or location)
4 Choose Measures: Define numerical metrics (ex. Sales amount) for the fact table.
Enterprise Warehouse
Collects all subject info spanning the organization. It features a unified approach for organizing and representing data, the ability to classify data according to subject and give access to those divisions (sales etc.). Normalized design.
DataMart
A DataMart is the access layer of the data warehouse environment. Used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts represent small slices of the data warehouse.
Virtual Warehouse
A set of SQL views over operational databases. Only some of the possible summary views may be persisted.
Data Warehouse Tools and Utilities
- Data Extraction: Get data from multiple heterogeneous and external sources
- Data Cleaning: Detect errors in the data and rectify where possible
- Data Transformation: Convert data from legacy or host format to warehouse format.
- Load: Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions.
- Refresh: Propagate the updates from the data source to the warehouse.
Metadata is the data defining warehouse objects. It stores:
- Description of the structure of the data warehouse (schemas, views…)
- Operational Metadata (Data lineage, currency of data…)
- Technical Metadata (DB system names, tables etc…)
- Algorithms used for summarization
- The mapping from operational environment to the data warehouse
- Data related to system performance (Data Warehouse Schema, View…)
- Business Data (Business terms and definitions, ownership of data…)
OLAP Server Architecture
- ROLAP
- MOLAP
- HOLAP
- Specialized SQL Servers
Relational OLAP (ROLAP)
Use relational or extended relational DBMS to store and manage warehouse data and OLAP middleware. Include optimization of DBMS backend, implementation of aggregation based navigation logic and more tools. Scalable.
Multidimensional OLAP (MOLAP)
Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data.
Hybrid OLAP (HOLAP)
Flexibility ex. Low level: relational, high level: array
Specialized SQL servers
ex Redbricks, specialized support for SQL queries over star/snowflake schemas.
Data Warehouse Usage
- Information Processing: Supports querying, basic statistical analysis, and reporting
- Analytical Processing: Multidimensional analysis of data warehouse data. Supports basic OLAP operations.
Data Mining:
Knowledge discovery from hidden patterns
Supports association, constructing analytical models, performing classification and prediction and presenting the mining results using visualization tools.
Table Descriptive Metadata
Physical Name
Logical Name
Type: Fact, Dimension, Bridge
Role: Legacy, OLTP, Stage
DBMS: DB2, Informix, MSSQL Server, Oracle
Location
Definition
Column Descriptive Metadata
Physical name
Logical name
Order in table
Datatype
Length
Decimal positions
Nullable/Required
Default Value
Edit Rules
Definition
Extract-Transform-Loads (ETL)
Extract - The process by which data is extracted from the data source
Transform - The transformation of the source data into a format relevant to the solution
Load - The loading of data into the warehouse
Develop the process backwards, start from business rules and code the ETL routines accordingly.
ETL Strengths
- Development Time: Since we are designing output backwards, only relevant data is extracted and processed.
- Targeted Data: Since we know the outcome of the load process, only presentation-relevant data is present in the warehouse. Also simplifies security regime and administration overhead.
- Tools Availability: A prolific number of tools are available that provide ETL functionality.
ETL Weaknesses
- Short-sighted requirements analysis: Targeting only relevant data means that future requirements will require changes to current ETL routines, maybe even re-designs.
- Hardware: Use of ETL processes will require additional hardware to accommodate the requirements of the ETL engine.
- Learning curve: 3rd party tools imply learning new scripting languages and applications. Lack of experience may lead to choosing difficult approaches.
Extract Load Transform (ELT)
Extract the data into a Staging Database
Apply data integrity and business rules check in the Staging Database
Load clean data into the warehouse
Transformations are done within the data warehouse
ELT Benefits
- Removes dependency between the load and transformation process.
- Extract and load process can include elements that may be needed in the future
- Break the project into smaller chunks (more predictable and manageable)
- Data integrity checks are performed earlier, so only cleaned and checked data is in the warehouse
ELT Strengths
- Project Management: Encourages specific and isolated tasks, therefore the project is manageable. Minimize risks by removing interdependencies between stages, and improving the change and maintenance phases due to process isolation.
- Flexible and Future Proof: All data from data sources can be loaded in the warehouse. Since transformation is a separate process, it is possible to accommodate future requirements.
- Utilizing existing hardware: Ability to use the tools provided by ODBMS. Alternatively, ELT can run on the same hardware as the data warehouse.
- Utilize existing skill set: ELT utilizes the existing tools.
ELT Weaknesses
Against the norm: ELT is an emerging technology, requires a paradigm shift.
Tools available: Not as many ELT tools as ETL
Data Staging
The construction site of the warehouse. Required by most scenarios, connected to a wide variety of sources. Used to clean, aggregate, compute and validate data
2 Models of remote staging
Staging area within the data warehouse environment and staging area outside the data warehouse.
Onsite staging model: Data staging area within the operational environment (possibly affecting the operational system).
Why do we need data marts?
Provide users with flexible access to frequently accessed data
Tailor data to match the specific needs of targeted user groups.
Improve end user response time for queries
Clearly define the scope of potential users, enabling targeted support.
Deliver data in structured forms suitable for end user tools and analytics.
Simplify and accelerate implementation compared to a corporate data warehouse
Lower implementation costs relative to a full-scale data warehouse.
Standalone Data Mart
Focuses exclusively on a single subject area or department, independent of a broader data warehouse. Collects data from multiple transactional systems for a specific business need. May use dimensional or entity-relationship models. Used by being queried by analytical or business intelligence tools.
Standalone Data Mart Adv.
Quick to build and deliver results
Lower costs compared to corporate data warehouses
Visible results in a short time frame