CH1 Flashcards
What is a data warehouse
subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making
subject oriented
Focused on business related subjects and organizational activity
Ex: Customers, employees, products, suppliers
Integrated
Data from different sources are stored in a consistent format
Requires addressing naming conflicts, unit discrepancies
. same naming format, concept, etc.
TIme-Varient
Maintains historical data
Each source may contain data at different time points –day, week, month etc. Used to analyze trends.
nonvolatile
Users cannot change data after it is entered. It is periodically updated.
Structured for query and analysis
Database vs. Data Warehouse
database for querying, for transaction processing.
Data warehouse takes the data from all different databases and creates a layer optimized for and dedicated to analytics.
A database is designed to handle transactions (OLTP). It is not structured to do analytics well.
A data warehouse, on the other hand, is structured to make analytics fast and easy.
Database you don’t do analysis
Need for Data Warehousing
Integrated, company-wide view of high-quality information (from disparate databases)
Separation of operational and informational systems and data (for improved performance)
Need for a Company-Wide View
Data in operational systems are fragmented and inconsistent.
Data is generally distributed on a variety of incompatible hardware and software.
Data must be consolidated to provide a single corporate view
Issues to resolve
Inconsistent key structures! Synonyms Free-form vs. structured fields Inconsistent data values Missing data
Organizational Trends Motivating Data Warehouses
No single system of records Multiple systems not synchronized Organizational need to analyze activities in a balanced way Customer relationship management Supplier relationship management
Data Warehouse Versus Data Mart
DataMart is subset of Data Warehouse. single subject.
DataMart is specific, decentralized, organic, some history, hightly denormalized.. Datamart could be a table in a database. segmented data(partitioned).
Logical Data Mart and Real-Time Data Warehouse
capture real time information
Independent Data Mart
Dependent Data Mart and Operational Data Store
Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
Independent: Independent data mart is created without the use of a central data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational systems.
all datamarts
All involve some form of extraction, transformation and loading (ETL)
why use a datamart?
Data Mart helps to enhance user’s response time due to reduction in volume of data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.
A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.
Dependent data mart with operational data store: a three-level architecture
ETL . see slide 16
near real time
can only be done if data is coming in clean
Extraction
connect via gatewarys and interfaces, odbc, oractle connect, etc.
data cleaning and transforming
Field length, anomalies, missing data etc.
Data migration -Include simple transforms – (field names etc.)
Data scrubbing - Use domain specific knowledge (zip codes)
Data auditing – discover rules, relationships, suspicious patterns
Load
Check integrity constrains (foreign key, primary key, null etc.)
Sorting, summarization, partition, index
Refresh
When and how
When - Depends on user needs, traffic, database server capabilities
How -Triggers (Oracle), Sniff transaction logs (MS)
Event
Event = a database action (create/update/delete) that results from a transaction
Status
after event status updated
transitory data
snapshot, whatever update before is lost.
transient data
With transient data, changes to existing records are written over previous records, thus destroying the previous data content
Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store
creaet second colum called Action column. C=current. U= update, you are updating values. so your storing the older rows(historical). D=Deleted data.
Database Limitation
Relational databases consists of tables Tables are flat one-dimensional Cross-tabs can be 2-dimensional How to represent multi-dimensional tables? The job of data warehousing/OLAP
Dimensional Modeling
Is a preferred technique to present analytic data
Contains same information as a normalized model
Deliver data that’s understandable to the user
Deliver fast query performance
Dimensional Modeling types
Star Schemas: Dimensional modeling implemented in relational DBMS (ROLAP)
OLAP (online analytical processing): Dimensional modeling in a multi-dimensional DB environment
OLAP cubes deliver superior query performance because of pre-calculations, indexing strategies, and other optimizations
ROLAP
Relational Olap model. Star Schemaa Dimensional modeling implemented in relational dbms. similar to rdms.
Star Schema
multiple tables that point to 1 fact table in relational dbms
Dimensional table
Dimensions are the subjects, who what when why
Fact table
collect data about subjects(dimensions)
Dimensions - context
who what where when hos of measurement, example total amount of sales, qty, etc.
dimensional model
characterize in fact or dimension tables.
same grain?
measurements rows in a fact table must be at the same grain. When creating table, at what detail of specification do you want to save the data. weekly, monthly, or individual transactions. they have to match.
Dimension Tables
Dimension tables often have many columns (attribute)
Each dimension table contains data for one dimension
Dimension table often represent hierarchical relationships
Product roll up into brands and then into categories
Each dimension is defined by a single primary key (PK)
PK serves as the basis of referential integrity with the given fact table to which it is joined.
Online Analytical Processing (OLAP) Tools
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
OLAP provides advanced query capabilities to the warehouse that standard SQL cannot
Complex queries that need to aggregate data can take hours to run
End users cannot be expected to issue SQL statements
Cube
A multidimensional structure consisting of “Data Cubes”
Dimension
Sides of a cube
Measure
Facts in a fact table
Aggregation
Projection of the cube
Cube shape
A Cube need not be in the shape of a cube at all.
It can have as many dimensions as necessary.
Each dimension can have as many “members” as necessary.
A cube can have as many dimensions as necessary.
what goes inside the cube
measures from fact table
What is a fact table?
The Fact Table is the table that provides the data for the elements of the Cube. There can be only one Fact Table per Cube!
What is an Aggregation
Is a Projection of the cube.
An Aggregation is a Projection of the Cube
An Aggregation Collapses Dimensions. Summation. similar to SQL “Group BY” Clause
OLAP Provides 4 types of Aggregation
sum, count, min, max
cube slicing
come up with 2-D view of data by filtering (fixing) a dimension. like a where clause, slice based on a condition. example slice between the males and femails to compare.
dicing
come up with a small cube (sub-cube) by selecting a subset of all dimensions
drill down
going from summary to more detailed views.
roll up
going from detailed views to a summary view.
pivot(rotate)
to rotate the cube across a dimension to see various faces.
add slide
61
operational systems vs. bi systems
information. This asset
is almost always used for two purposes: operational record keeping and analytical
decision making. Simply speaking, the operational systems are where you put the
data in, and the DW/BI system is where you get the data out
What are operational Systems?
The operational systems are optimized to process transactions quickly.
These systems almost always deal with one transaction record at a time. They predictably
perform the same operational tasks over and over, executing the organization’s
business processes. Given this execution focus, operational systems typically do not
maintain history, but rather update data to refl ect the most current state.
BI Users
Users of a DW/BI system, on the other hand, watch the wheels of the organization
turn to evaluate performance. They count the new orders and compare them
with last week’s orders, and ask why the new customers signed up, and what the
customers complained about. They worry about whether operational processes are
working correctly. Although they need detailed data to support their constantly
changing questions, DW/BI users almost never deal with one transaction at a time.
These systems are optimized for high-performance queries as users’ questions often
require that hundreds or hundreds of thousands of transactions be searched and
compressed into an answer set. To further complicate matters, users of a DW/BI
system typically demand that historical context be preserved to accurately evaluate
the organization’s performance over time
Business management requirements for BI/DataWarehouse
The DW/BI system must make
information easily accessible. The contents
of the DW/BI system must be understandable
The DW/BI system must present information consistently. The data in the
DW/BI system must be credible. Data must be carefully assembled from a
variety of sources, cleansed, quality assured, and released only when it is fi t
for user consumption.
The DW/BI system must adapt to change. User needs, business conditions,
data, and technology are all subject to change
The DW/BI system must present information in a timely way
The DW/BI system must be a secure bastion that protects the information
assets
The DW/BI system must serve as the authoritative and trustworthy foundation
for improved decision making
The business community must accept the DW/BI system to deem it successful
Dimensional modeling is widely accepted as the preferred technique
for presenting analytic data because it addresses two simultaneous requirements
■ Deliver data that’s understandable to the business users.
■ Deliver fast query performance.
difference between 3NF and dimensional model
Both 3NF and dimensional models can
be represented in ERDs because both consist of joined relational tables; the key
diff erence between 3NF and dimensional models is the degree of normalization.
Because both model types can be presented as ERDs, we refrain from referring to
3NF models as ER models; instead, we call them normalized models to minimize
confusion.
same info as normalized model?
A dimensional model contains the same information as a normalized
model, but packages the data in a format that delivers user understandability, query
performance, and resilience to change.
Star Schemas Versus OLAP Cubes
Dimensional models implemented in relational database management systems are
referred to as star schemas because of their resemblance to a star-like structure.
Dimensional models implemented in multidimensional database environments are
referred to as online analytical processing (OLAP) cubes, as illustrated in Figure 1-1.
loading data in OLAP Cube
When data is loaded into an OLAP cube, it is stored and indexed using formats
and techniques that are designed for dimensional data. Performance aggregations
or precalculated summary tables are often created and managed by the OLAP cube
engine.
analytics with OLAP
OLAP cubes also provide more analytically
robust functions that exceed those available with SQL. The downside is that you
pay a load performance price for these capabilities, especially with large data sets.
what does a fact represent?
a business measure
how and were store measurement data?
You should strive to store the
low-level measurement data resulting from a business process in a single dimensional
model. Because measurement data is overwhelmingly the largest set of data,
it should not be replicated in multiple places for multiple organizational functions
around the enterprise. Allowing business users from multiple organizations to access
a single centralized repository for each set of measurement data ensures the use of
consistent data throughout the enterprise
what does each row in a fact table represent?
Each row in a fact table corresponds to a measurement event. The data on each
row is at a specifi c level of detail, referred to as the grain, such as one row per product on a sales transaction
one to one measurement
The idea that a measurement event in the physical world has a one-to-one
relationship to a single row in the corresponding fact table is a bedrock principle
for dimensional modeling. Everything else builds from this foundation.
most useful facts
The most useful facts are numeric and additive, such as dollar sales amount.
Additivity is crucial because BI applications rarely retrieve a single fact table
row. Rather, they bring back hundreds, thousands, or even millions of fact rows at
a time, and the most useful thing to do with so many rows is to add them up.
semi additive and non additive facts
Semi-additive facts, such as account balances, cannot be summed across
the time dimension. Non-additive facts, such as unit prices, can never be added.
fact table
grains fall into one of three categories
ransaction, periodic snapshot, and accumulating
snapshot. Transaction grain fact tables are the most common
fact tables and foreign keys
All fact tables have two or more foreign keys (refer to the FK notation in Figure 1-2)
that connect to the dimension tables’ primary keys.
what is a fact table primary key made of?
The fact table generally has its own primary key composed of a subset of the foreign
keys. This key is often called a composite key. Every table that has a composite
key is a fact table. Fact tables express many-to-many relationships. All others are
dimension tables.
what does dimension table describe?
They
describe the “who, what, where, when, how, and why” associated with the event.
Dimension tables tend to have fewer rows than fact tables, but can be wide with
many large text columns.
Each dimension is defi ned by a single primary key (refer
to the PK notation in Figure 1-3), which serves as the basis for referential integrity
with any given fact table to which it is joined
What are dimension tables primary source of?
Dimension attributes serve as the primary source of query constraints, groupings,
and report labels. In a query or report request, attributes are identifi ed as the
by words. For example, when a user wants to see dollar sales by brand, brand must
be available as a dimension attribute
what should attributes consist of in dimension tables?
Attributes should
consist of real words rather than cryptic abbreviations. You should strive to minimize
the use of codes in dimension tables by replacing them with more verbose
14 Chapter 1
textual attributes.In many ways, the data warehouse is only as good as the dimension attributes; the
analytic power of the DW/BI environment is directly proportional to the quality and
depth of the dimension attributes
how to tell if it’s a dimension or should be in fact table
You often make the decision
by asking whether the column is a measurement that takes on lots of values and
participates in calculations (making it a fact) or is a discretely valued description
that is more or less constant and participates in constraints and row labels (making
it a dimensional attribute). For example, the standard cost for a product seems like
a constant attribute of the product but may be changed so often that you decide it
is more like a measured fact.
do fact tables or dimension tables represent fierarchical relationships?
dimension tables often represent hierarchical relationships.
For example, products roll up into brands and then into categories. For each
row in the product dimension, you should store the associated brand and category
description. The hierarchical descriptive information is stored redundantly in the
spirit of ease of use and query performance. You should resist the perhaps habitual
urge to normalize data by storing only the brand code in the product dimension and
creating a separate brand lookup table, and likewise for the category description in a
separate category lookup table
snowflaking
This normalization is called snowfl aking. Instead of
third normal form, dimension tables typically are highly denormalized with fl attened
many-to-one relationships within a single dimension table. Because dimension tables
typically are geometrically smaller than fact tables, improving storage effi ciency by
normalizing or snowfl aking has virtually no impact on the overall database size. You
should almost always trade off dimension table space for simplicity and accessibility
what data has the most dimensionality
This book illustrates repeatedly that the most granular or atomic data has the
most dimensionality. Atomic data that has not been aggregated is the most expressive
data; this atomic data should be the foundation for every fact table design to
withstand business users’ ad hoc attacks in which they pose unexpected queries.
With dimensional models, you can add completely new dimensions to the schema
as long as a single value of that dimension is defi ned for each existing fact row.
Likewise, you can add new facts to the fact table, assuming that the level of detail
is consistent with the existing fact table. You can supplement preexisting dimension
tables with new, unanticipated attributes. In each case, existing tables can be
changed in place either by simply adding new data rows in the table or by executing
an SQL ALTER TABLE command.
SELECT store.district_name, product.brand, sum(sales_facts.sales_dollars) AS "Sales Dollars" FROM store, product, date, sales_facts WHERE date.month_name="January" AND date.year=2013 AND store.store_key = sales_facts.store_key AND product.product_key = sales_facts.product_key AND date.date_key = sales_facts.date_key GROUP BY store.district_name, product.bran
If you study this code snippet line-by-line, the fi rst two lines under the SELECT
statement identify the dimension attributes in the report, followed by the aggregated
metric from the fact table. The FROM clause identifi es all the tables involved
in the query. The fi rst two lines in the WHERE clause declare the report’s fi lter, and
the remainder declare the joins between the dimension and fact tables. Finally, the
GROUP BY clause establishes the aggregation within the report.
Operational Source Systems
These are the operational systems of record that capture the business’s transactions.
Think of the source systems as outside the data warehouse because presumably you
have little or no control over the content and format of the data in these operational
systems. The main priorities of the source systems are processing performance and availability.
Operational queries against source systems are narrow, one-record-at-a-time
Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 19
queries that are part of the normal transaction fl ow and severely restricted in their
demands on the operational system
The extract, transformation, and load (ETL)
environment consists
of a work area, instantiated data structures, and a set of processes. The ETL system
is everything between the operational source systems and the DW/BI presentation
area
3 step process of ETL
1.Extraction is the fi rst step in the process of getting data into the data warehouse
environment. Extracting means reading and understanding the source data and
copying the data needed into the ETL system for further manipulation. At this
point, the data belongs to the data warehouse.
2.After the data is extracted to the ETL system, there are numerous potential transformations,
such as cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats), combining
data from multiple sources, and de-duplicating data. The ETL system addsvalue to the data with these cleansing and conforming tasks by changing the data
and enhancing.
3.fi nal step of the ETL process is the physical structuring and loading of data
into the presentation area’s target dimensional models. Because the primary mission
of the ETL system is to hand off the dimension and fact tables in the delivery
step, these subsystems are critical