Chapter 9 Flashcards
What is a data warehouse?
A subject-oriented, integrated, time-variant, non-updateable collection of data used in support of management decision-making processes.
(Subject-oriented = customers, patients, etc)
(Integrated = consistent naming conventions formats, encoding sturctures; from multiple data sources)
(Time-variant = can study trends and changes)
(Non-updatable = read-only, periodically refreshed)
What is a data mart?
A data warehouse that is limited in scope
Why is there a need for data warehousing?
- For an integrated, company-wide view of high-quality information (from disparate databases)
- For seperation of operational and informational systems and data (for improved performance)
What are some issues with company-wide view?
- Inconsistent key structures
- Synonyms
- Free-form vs. structured fields
- Inconsistent data values
- Missing data
What organizational trends drive data warehouses?
- There is no single system of records
- They have multiple systems that are not synchronized
- They have an organizational need to analyze activities in a balanced way
- Customer relationship management
- Supplier relationship management
What is an operational system?
A system that is used to run a business in real time, based on current data; also called a system of record
What is an informational system?
A system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications
Comparison of operational and inofrmational systems
What are the data warehouse architectures?
- Independent data mart
- Dependent data mart and operational data store
- Logical data mart with real-time data warehouse
- three-layer architecture
(all involve some form of extract, transform and load (ETL)
Independent Data Mart visual
What are the limitations of an independent data mart?
- Seperate ETL process for each data mart -> redundant data and processing
- Inconsistency between data marts
- Difficult to drill down for related facts between data marts
- Excessive scaling costs as more applications are built
- High cost for obtaining consistency between marts
Dependent data mart visual
logical data mart visual
Differences in Data warehouse and Data mart
Visual of three-layer data architecture for a data warehouse
What is an event?
a database action (creat/update/delete) that results from a transaction
Status vs event visual
What happens with transient data?
Changes to existing records are written over previous records, thus destroying the previous data content.
What happens with periodic data?
They are never physically altered or delected once they have been added to the store.
Besides transient data coverting to periodic data what other six changes must happen when going from a data mart to data warehouse?
- New descriptive attributes
- New business activity attributes
- New classes of descriptive attributes
- Descriptive attributes become more refined
- Descriptive data are related to one another
- New source of data
What is derived data?
Data that have been selected, formatted, and aggregate for end-user decision support applications
What are the objectives of derived data?
- Ease of use for decision support applications
- Fast response to predefined user queries
- Customized data for particular target audiences
- Ad-hoc query support
- Data mining capabilities
What are the characteristics of derived data?
- Detailed (mostly periodic) data
- Aggregate (for summary)
- Distributed (to departmental servers)
What is the most common data model?
Dimensional model (usually implemented as a star schema)
Star schema visual
Start schema example
What does surrogate mean?
Non-intelligent and non-business related
Why should dimension table keys be surrogate?
- Business keys may change over time
- Helps keep track of nonkey attribute values for a given production key
- Surrogate keys are simpler and shorter
- Surrogate keys can be same length and format for all keys
What are the fact table granularities?
Transactional grain - finest level
Aggregrate grain - more summarized
What are the advatantages of a finer grain?
- Better market basket analysis capability
- More dimenstion tables, more rows in a fact table
( In web-based commerce, finest graunlarity is a click)
What are the durations of a database?
- Natural duration - 13 months or 5 quarters
- Financial institutions may need longer durations
- Older data is more difficult to source and cleanse
What determines the size of a fact table?
- It depends on the number of dimensions and the grain of the fact table
- Number of rows = product of number of possible values for each dimension associated witht he fact table
Example of how to determine the size of a fact table
Benefits of multiple facts tables
- Can improve performance
- Often used to store facts for different combinations of dimensions
- Conformed dimensions
What is a factless facts table?
- It has no nonkey data, but foreign keys for associated dimensions
- Used for; Tracking events, Inventory coverage
Conformed dimensions example
Example of a factless fact table
What are multivalued Dimensions?
- Facts qualified by a set of values for the same business subject
- Normalization involves creating a table for an associative entity between dimensions
What are hierarchies?
- Sometimes a dimension forms a natural, fixed depth hierarchy
- Design options
What are design options for hierarchies?
- Include all information for each level in a single denormalized table
- Normalize the dimension into a nested set of 1:M table relationships
What is a Helper table?
An associative entity that implements a M:N relationship between deminsion and fact.
Multivalued Dimension example
What do dimension hierarchies provide?
Levels of aggregation for users wanting summary information in a data warehouse.
Example of fixed product hierarchy
How are Kimball’s approaches to maintaining knowledge of the past in Slowly Changing Dimensions (SCD)?
- Type 1: Just replace old data with new (lose historical data)
- Type 2: For each changing attribute, create a current value field and several old-valued fields (multivalued)
- Type 3: Create a new dimension table row each time the dimension object changes, with all dimension characteristics at the time of change. Most common approach
Type 2 SCD example
What should be used for rapidly changing attributes (hot attributes) instead of Type 2 SCD which creates too many rows and too much redundant data?
Dimension segmentation
What are 10 essential rules for dimensional modeling?
- Use atomic facts - Honor hierarchies
- Create single-process fact tables - Decode dimension tables
- Include a date dimension for each fact table - Use surrogate keys
- Enforce consistent grain - Conform dimensions
- Disallow null keys in fact tables - Balance requirements with actual data
What is the future of data warehousing?
Integrations of Big data and Analytics
What are some problems with the integration of big data and analytics?
- Big data has huge volume and is often unstructured
How is the speed of processing an issue with the integration of big data and analytics with data warehousing?
- Design/purchase storage, database, and networking aspects in tandem
- Use in-memory databases (RAM instead of disk)
- Add analytics capabilities closer to the original data sources instead of seperate data warehouses
What Cost of Data Storage considerations are there with Data warehousing/big data integration?
- Moving data warehouse to the cloud
- Columnar databases optimize storage