Class 2 - Chapter 3 Flashcards
Characteristics of Data Warehouse
Subject-oriented
Focused on business related subjects and organizational activity
Ex: Customers, employees, products, suppliers
Integrated
Data from different sources are stored in a consistent format
Requires addressing naming conflicts, unit discrepancies
Time-variant
Maintains historical data
Each source may contain data at different time points –day, week, month etc. Used to analyze trends.
Nonvolatile
Users cannot change data after it is entered. It is periodically updated.
Structured for query and analysis
3 different data Warehouse Architectures
independant Data Mart- each data mart has own
ETL process. not centralized storage location. stored in different datamarts, users have access to all data
Dependenat Data Mart and Operational Data Store
ETL process where data is parititoned, same format, standardization because loaded from same systems.
Logical Dat Mart and Real-Time Data Warehouse. real time update of the data. you get data from sources and save in data warehouse, different sections for different data mart. It’s not actual data but a view. more of a dashboard thing.
OLAP
multidimensional view of the data
What is aggregation?
is like the SQL “Group BY’ summing up a 2 dimension view. so if you had 3 dimensions, you aggregate 1 dimension to make 2 dimension.
slice
fix one attribute of one dimension. filter male. fix one and becomes 2 dimensional view.
dice
take subset of the data. you have cube that you slice and create a subcube.
drill down
data has heirarchies, you have say date dimension. you can drill down to week.
rollup
you aggregate
foru step dimensional modeling process.
Step 1?
select business process. talk to users. sequence of tasks to create outputs from input. production process, procurement process.
foru step dimensional modeling process.
Step 12?
Declare
Business process?
Business Process
What is a business process?
Sequence of tasks or activities that take a set of inputs and convert them into desired output.
Procurement, Fulfillment, Production, etc.
Common characteristics
Action verbs
Supported by IT systems
Input/Output/Metrics/KPI
Examples: Retail -> sales, Healthcare -> treatment
. Interview the actors. operations person, etc.
How to select Business Process?
QUICK WINS
Activities that give the best return based on the effort. They are valuable and fundamental to your success.
MAJOR PROJECTS
Activities what provide long term returns but may be more complex to execute.
FILL-INS
Unimportant activities including everyday tasks.
TIME WASTERS
These activities are time-consuming and require resources that are better spent on other things
Step 2: Declare the Grain
Grain: Level of detail associated with fact table measurements
What level of granularity should be used?
“How do you describe each row in a fact table?”
Note: Include lowest level (atomic) of detail as it can be constrained and rolled up in every possible way.
Step 2: Declare the Grain (Contd.)
Examples
One row per scan of individual product on customer sales transaction
One row per product under promotion per store each day.
One row per bank account each month.
One row per individual boarding pass scanned at an airport gate
Insurance?
Transportation? Ride Share Service?
Education?
fact table
has all the measures, the grain.
Step 3: Identify Dimensions
Context of measurements
Who, what, why, where, when, how
Data types – fact vs dimensions
Natural Key
– System generated operational identifier.
Step 4: Identify Facts
What is the process measuring?
Key performance metrics
Typical facts are numeric additive figures
Business requirements vs source data
Examples? Quantity sold, current balance, unit price.
Note: facts in a design must be true to the grain defines in Step 2.
Fully Additive facts:
Measures can be summed across any dimensions Examples?
Sales, Cost
Semi Additive Facts
Semi-Additive: Measures can be summed across some dimensions, but not all. Examples?
Balance amount across time dimension
grain
level of detail you want to capture, should be same across all rows in fact table.
Non Additive Facts
Non-Additive: Measures cannot be summed across any dimension. Example?
Ratios, unit price, salary,
Dimensional table
doesnt usually have facts, more like product description, location, when, where purchased. connected to fact table using pk/fk. fact table has fk of other tables.
Wall mart example:
What are the data captured by Walmart retail stores? Point of Sale data: Sales Transactions Products Sold Promotions Customer Goods delivery data: Vendor Products Time of delivery Product return data: Products returned Damage/exchange
Retail: Step 1 Select Business Process
What are the business requirements?
Logistics of ordering, stocking, and selling products while maximizing profit.
Start with most critical process as well as the most feasible.
Process: POS Retail Sales Transactions Process
Which products are selling?
in which store?
on which days?
under which promotion?
These will be fore dimension tables
Retail: Step 2 Declare Grain
“What level of data detail should be made available in a dimensional model?”
Use most detailed atomic information provides maximum analytical flexibility because it can be rolled up in every single way possible.
In many situations users are not given access to atomic data which reduces its capability.
Most granular data in retail -> POS transactions at product level.
Retail: Step 3 Identify Dimensions.
Typically what are the dimensions for a retail store?
Typically what are the dimensions for a retail store? Product Store Date/Time Payment Method Promotion Cashier
Retail: Step 4 Identify Facts
“Facts must be true to grain” Quantity Regular unit price Discount unit price Net unit price Extended discount price Extended Sales dollar amount Extended cost dollar amount Extended gross profit
what is retail dimension table
dimensions: date, payment method, product, promotion, cashier . in star in center is Retail Sales Facts
Modeling Time Dimension
Is date dimension needed?
Almost all businesses capture time series of performance metrics.
It is frequently used for partitioning data.
Unlike other dimensions you can build the date dimension in advance.
There are many date attributes not supported by the SQL date function such as week numbers, fiscal periods, seasons, holidays, and weekends.
Is date dimension needed?
Yes, mostly. Date dimension that can be created before you create other tables.Its standardized on the different databases. You need to standardize date format. Traverse over date dimensin. can capture time series of perfrmance metrics. you can partition the data.
what is Date Dimension
What should date dimension capture? Day, Month, Year Weekday, Weekend Day of Week Day of Month Calendar vs Fiscal
Date vs. Time of Day dimensions
Date -> 20yrs -> 7300 approx.
Time of Day -> 20 x 365 x 24 x 60 = 10,512,000
what is Product Dimension
Product dimension contains every single SKU (Stock Keeping Unit)
Hierarchy: Product
Numeric Values as Attributes or Facts
Should Standard List Price be in Product dimension or in the fact table?
Standard List Price change infrequently.
Standard List Price is non-additive, however, if multiplied by quantity, it gives sales amount which is additive.
Note:
If the numeric value is used primarily for calculation purpose, it likely belongs to the fact table.
If the stable numeric value is used predominantly for filtering and grouping, it should be treated as a dimensional attribute.
what is Store Dimension
Describes every store
Primary geographic dimension
Geographic Hierarchy
Internal District Hierarchy
what is Promotion Dimension
Most interesting dimension, often called causal dimension
Promotion conditions: price reduction, end aisle display, news paper ad, coupons
Whether a promotion is effective?
Need to capture each combination of promotion
Modeled as separate dimensions or combined
Null values in fact or dimension tables
what is causal dimension
Most interesting dimension, often called causal dimension
Promotion conditions: price reduction, end aisle display, news paper ad, coupons
what is Schema Extensibility
Example: Retailer adds frequent shopper program – how would this be supported?
Add new dimension table, surrogate keys: Prior to Frequent shopper Program, Frequent shopper Not Identified
Modeling at most granular level allows extensibility Extension schemas by adding New dimensions New dimension attributes New measured facts
What are Factless Fact Table
Some business processes track events without any real measures. If the event happens, we get an entry in the system; if not, there is no row.
Example: Employee hiring and firing, student attending a class.
The fact tables that track these events typically do not have any actual fact measurements, so they are called factless fact tables.
We usually add a column called EventCount containing number 1 to count the number of events by summing the EventCount fact.
Do you need a Promotion Dimension
Yes, because in product dimension you would have
a lot of rows with NULL values. How do you connect with Fact Table.
DD
attribute that doesnt have it’s own dimension. POST Transaction# . generally system generated values. can be used as serigate key.
Factless Fact Tables – Promotion Coverage
What products were on promotion but did not sell?
Add a Promotion Coverage fact table to include product, date, store and promotion attributes for products which are under promotion.
Factless Fact Tables – Promotion Coverage
The Promotion Coverage fact table is a factless fact table. Each row of this table shows the product on promotion regardless of whether the product sold.
The grain is different from fact table. not just what was sold but was on promotion.
Question: Find the products which were on promotion but not sold.
Step1: Query the Promotion Coverage table to identify the products under promotion on a given day.
Step 2: Determine what products sold from POS sales fact table on that day.
Step 3: Find the difference between above two lists of products.
Fact and Dimension Table Keys
Use surrogate keys as primary keys
Integer keys, sequential
Natural key: operational system identifier -> store number, time stamp
What are advantages of surrogate keys
Advantages of surrogate keys Buffer from operational changes Integrate multiple source systems Improve performance Handle null or unknown conditions Dimension attribute change tracking
snowflake schema
hierarchy. you are normalizing the tables.
Advantages of Normalization of DimTables
Saves space by reducing redundancy
Need to update only once.
Disadvantages of normalization of DimTables
Model becomes complex
Numerous joins make query slow
slicing query example:
-Slicing: This query generates a two dimensional slice by fixing (filtering) one of the dimension value.
SELECT CustomerID, ProductID, StoreID, Quantity FROM FactPurchase_t WHERE CustomerID=1;
dicing query example:
–Dicing: This query selects a subset over all dimensions.
SELECT CustomerID, ProductID, StoreID, Quantity FROM FactPurchase_t WHERE CustomerID=1 and ProductID=11 and StoreID=101;
all types rollup. cube
[dbo].[DimCustomer_t]
grouping sets
instead of union, to merge into reports
-The GROUPING SETS operator generates a result set equivalent to that generated by a UNION ALL of multiple simple GROUP BY clauses.
SELECT CustomerID, ProductID, StoreID, SUM(Quantity) AS TotalCount FROM FactPurchase_t GROUP BY GROUPING SETS(CustomerID, ProductID, StoreID);