Data Warehousing - Data Modeling Flashcards
What is a ‘dimension’ and what are it’s primary functions?
Dimensions enable business intelligence users to analyze data using simple queries. They focus on questions of: who, when, where and what.
The primary key of a dimension should be a surrogate key and is typically an integer. This primary key is then related to facts to put the facts into context.
A dimension is a structure/table that categorizes ‘facts’ and ‘measures’ in order to enable users to answer business questions. They provide structured labeling information to otherwise unordered numeric measures.
Commonly used dimensions are: people, products, place and time. It is a data-set of individual, non-overlapping data elements.
The primary functions of dimensions are to provide:
1) Filtering
2) Grouping
3) Labeling
These three functions are also called: ‘Slice and Dice’
Slicing is the filtering of data, Dicing is the grouping of data.
Dimension tables usually have a relatively small number of records compared to fact tables, but each record may have a very large number of attributes to describe the fact data.
What are the different ‘types’ of dimensions?
1) Conformed dimension
- dimensions that share same key and attributes, structure. Identical or a subset.
2) Junk dimension
A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.
3) Degenerate dimension
- is a key (like invoice #) that has no attributes and does not join to a dimension table
- common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent.
4) Role-playing dimension
- recycled for multiple use, like Date (created, hired, etc)
What is a ‘fact’ table?
Facts contain quantitative measurements. They focus on the answering the questions: how much and how many.
Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.
A fact table is a collection of keys and measures. The keys relate each row in the fact table to an associated row in a dimension table.
A Fact table consists of the measurements, metrics or facts of a business process.
A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures.
What is the ‘grain’ of a fact table?
The grain of a fact table represents the most atomic level by which the facts may be defined.
The grain of a SALES fact table might be stated as “Sales volume by Day by Product by Store”
A fact can be fine grained and represent a single event or transaction or it can be course grained and aggregate measurements over a period of time.
What is a ‘measure’?
Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated.
-Measures (anything that is not a ‘key’ column), are the operational data that is made ready for analysis.
Measures are data elements such as OrderQuantity, DiscountAmount, and TaxAmount that are used to determine all types of statistical information, such as the percentage of sales that were discounted or the total sales tax collected per region.
What is the ‘star schema’?
The simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts.
The star schema consists of one or more fact tables referencing any number of dimension tables.
The star schema gets its name from the physical model’s resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star’s points.
The star schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.
What are the benefits and disadvantages of the star schema?
Benefits
- due to denormalization:
- simpler queries
- simpler business reporting logic
- query performance gains
- fast aggregations
- building OLAP cubes
Disadvantages
- data integrity no enforced, de-normalized state
What is the ‘Snowflake’ schema?
Same as the Star Schema but the dimension tables are de-normalized and join with small dimension tables.
Normalization tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes
From a space storage point of view, dimensional tables are typically small compared to fact tables. This often negates the potential storage-space benefits of the star schema as compared to the snowflake schema.
What is ‘dimensional modeling’?
Dimensional modeling is made up of both logical and physical modeling. Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated. Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.
Fact table > measures and keys > dimension tables
Datamart = a collection of Fact tables
Explain dimensional vs ER modeling?
Dimensional modeling is a logical design technique. Unlike ER modeling, which consists of conceptual, logical, and physical data modeling, dimensional modeling is made up of only logical and physical modeling.
There are sharp contrasts between ER and dimensional modeling. ER modeling is a design discipline that seeks to represent business rules as highly detailed relationships between business elements that are materialized as database tables. You can extrapolate the business rules from the types and cardinalities of the relationships in an ER model. The primary goal of ER modeling is to remove all non-key data redundancy.
Dimensional modeling, however, seeks to represent data in a logical, understandable manner. In dimensional modeling, you can control data redundancy by conforming dimension and fact tables. A table that’s been conformed can be used in more than one dimensional data model and is the same no matter how it’s used. The relationships in a dimensional model don’t represent business rules; instead, they’re navigational paths used to help write reports or create graphs.
What is ETL?
Extraction, transformation, and loading
How would you design a fact table?
- identify the business process for analysis (sales)
- identify the measures of facts (sales dollar)
- identify the dimensions for facts (product, location, time and org dimensions)
- list the columns that describe each dimension
- determine the lowest grain of summary (sales dollars)
What are the different types of fact tables?
Transactional Fact Table
- One row per fact occurring at a certain point in time
- “one row per line in a transaction”, e.g., every line on a receipt
Periodic Snapshot Fact Table
- One row per entity in the fact table
- takes a “picture of the moment”, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table
Accumulating Snapshot Fact Table
- holds a ‘snapshot’ of all data for a specific point in time
- used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order
- An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.
Temporal Snapshot Fact Table
- each row represents a fact that occurred during a time interval not at a point in time
- llows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.
What are the two popular design approaches when designing a data warehouse?
- Inmon vs Kimball
- top down vs bottom up
Bill Inmom
- top-down design
- normalized data model is designed first
- then the dimensional data marts are created from the data warehouse
Ralph Kimball
- bottom-up design
- the data marts facilitating reports and analysis are created first
- those are then combined to create data marts and a broad data warehouse
- remember data marts are a collection of fact tables and their related dimension tables
What are some industry recommendations when choosing between the two popular design approaches, Inmon vs Kimball?
Insurance - need for overall picture with interrelated data elements are best suited for the Inmon approach.
Marketing - specialized, no enterprise data warehouse needed, only data marts required so Kimball’s approach is suitable
CRM in banks - either, depending on the situation and need to link customer to details across business units. If limited scope then Kimball, if not then Inmon is better,
Manufacturing - multiple functions require enterprise model, so Inmon’s is ideal.