Data Warehousing - Data Modeling Flashcards
What is a ‘dimension’ and what are it’s primary functions?
Dimensions are companions to facts. They categorize facts and measures to enable the analysis of data using simple queries. They focus on questions of who, when, where and what.
The primary key of a dimension should be a surrogate key and is typically an integer. This primary key is then related to facts to put the facts into context.
A dimension is a structure/table that categorizes ‘facts’ and ‘measures’ in order to enable users to answer business questions. They provide structured labeling information to otherwise unordered numeric measures.
Commonly used dimensions are people, products, place and time. It is a data-set of individual, non-overlapping data elements.
Dimension tables usually have a relatively small number of records compared to fact tables, but each record may have a very large number of attributes to describe the fact data.
What are the different ‘types’ of dimensions?
1) Slowly changing dimension
2) Conformed dimension
3) Junk dimension
4) Degenerate dimension
5) Role-playing dimension
6) Outrigger dimension
7) Shrunken dimension
8) Calendar date dimension
What is a ‘fact’ table?
Fact tables contain quantitative measurements, or facts, of a business process, commonly associated with points in time. They focus on answering questions like how much and how many.
They are used in trends, comparisons, aggregations, and groupings. They feed analysis and visualization tools to allow insights to be discovered about the functional area.
- It is located at the center of a star schema or a snowflake schema surrounded by dimension tables.
- Where multiple fact tables are used, these are arranged as a fact constellation schema.
- A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables.
- The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
- Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi-additive measures.
- A fact table is a collection of keys and measures. The keys relate each row in the fact table to an associated row in a dimension table.
What is the ‘grain’ of a fact table?
The grain of a fact table represents the most atomic level by which the facts may be defined.
The grain of a SALES fact table might be stated as “Sales volume by Day by Product by Store”
A fact can be fine grained and represent a single event or transaction or it can be course grained and aggregate measurements over a period of time.
What is a ‘measure’?
A measure is a quantitative attribute of a fact (in a fact table) that is not a foreign key that creates a relationship to a dimension.
Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated.
Measures are data elements such as OrderQuantity, DiscountAmount, and TaxAmount that are used to determine all types of statistical information, such as the percentage of sales that were discounted or the total sales tax collected per region.
What is the ‘star schema’?
The simplest style of data warehouse schema, the star schema consists of one or more fact tables referencing any number of dimension tables.
The star schema gets its name from the physical model’s resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star’s points.
The star schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.
What are the benefits and disadvantages of the star schema?
Benefits
- due to denormalization:
- simpler queries
- simpler business reporting logic
- query performance gains
- fast aggregations
- building OLAP cubes
Disadvantages
- data integrity no enforced, de-normalized state
What is the ‘Snowflake’ schema?
A variant of the Star Schema but the dimension tables are de-normalized and join with multiple related dimension tables.
The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent tables. The snowflake effect affects only the dimension tables and does not affect the fact tables.
Normalization tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes
From a space storage point of view, dimensional tables are typically small compared to fact tables. This often negates the potential storage-space benefits of the star schema as compared to the snowflake schema.
What is ‘dimensional modeling’?
Dimensional modeling is made up of both logical and physical modeling. Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated. Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.
Fact table > measures and keys > dimension tables
Datamart = a collection of Fact tables
Explain dimensional vs ER modeling?
Dimensional modeling is a logical design technique. Unlike ER modeling, which consists of conceptual, logical, and physical data modeling, dimensional modeling is made up of only logical and physical modeling.
There are sharp contrasts between ER and dimensional modeling. ER modeling is a design discipline that seeks to represent business rules as highly detailed relationships between business elements that are materialized as database tables. You can extrapolate the business rules from the types and cardinalities of the relationships in an ER model. The primary goal of ER modeling is to remove all non-key data redundancy.
Dimensional modeling, however, seeks to represent data in a logical, understandable manner. In dimensional modeling, you can control data redundancy by conforming dimension and fact tables. A table that’s been conformed can be used in more than one dimensional data model and is the same no matter how it’s used. The relationships in a dimensional model don’t represent business rules; instead, they’re navigational paths used to help write reports or create graphs.
What is ETL?
Extraction, transformation, and loading
How would you design a fact table?
- identify the business process for analysis (sales)
- identify the measures of facts (sales dollar)
- identify the dimensions for facts (product, location, time and org dimensions)
- list the columns that describe each dimension
- determine the lowest grain of summary (sales dollars)
What are the different types of fact tables?
Transactional Fact Table
- One row per fact occurring at a certain point in time
- “one row per line in a transaction”, e.g., every line on a receipt
Periodic Snapshot Fact Table
- One row per entity in the fact table
- takes a “picture of the moment”, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table
Accumulating Snapshot Fact Table
- holds a ‘snapshot’ of all data for a specific point in time
- used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order
- An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.
Temporal Snapshot Fact Table
- each row represents a fact that occurred during a time interval not at a point in time
- llows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.
What are the two popular design approaches when designing a data warehouse?
- Inmon vs Kimball
- top down vs bottom up
Bill Inmom
- top-down design
- normalized data model is designed first
- then the dimensional data marts are created from the data warehouse
Ralph Kimball
- bottom-up design
- the data marts facilitating reports and analysis are created first
- those are then combined to create data marts and a broad data warehouse
- remember data marts are a collection of fact tables and their related dimension tables
What are some industry recommendations when choosing between the two popular design approaches, Inmon vs Kimball?
Insurance - need for overall picture with interrelated data elements are best suited for the Inmon approach.
Marketing - specialized, no enterprise data warehouse needed, only data marts required so Kimball’s approach is suitable
CRM in banks - either, depending on the situation and need to link customer to details across business units. If limited scope then Kimball, if not then Inmon is better,
Manufacturing - multiple functions require enterprise model, so Inmon’s is ideal.