Data Warehousing - Data Modeling Flashcards

1
Q

What is a ‘dimension’ and what are it’s primary functions?

A

Dimensions are companions to facts. They categorize facts and measures to enable the analysis of data using simple queries. They focus on questions of who, when, where and what.

The primary key of a dimension should be a surrogate key and is typically an integer. This primary key is then related to facts to put the facts into context.

A dimension is a structure/table that categorizes ‘facts’ and ‘measures’ in order to enable users to answer business questions. They provide structured labeling information to otherwise unordered numeric measures.

Commonly used dimensions are people, products, place and time. It is a data-set of individual, non-overlapping data elements.

Dimension tables usually have a relatively small number of records compared to fact tables, but each record may have a very large number of attributes to describe the fact data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different ‘types’ of dimensions?

A

1) Slowly changing dimension
2) Conformed dimension
3) Junk dimension
4) Degenerate dimension
5) Role-playing dimension
6) Outrigger dimension
7) Shrunken dimension
8) Calendar date dimension

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a ‘fact’ table?

A

Fact tables contain quantitative measurements, or facts, of a business process, commonly associated with points in time. They focus on answering questions like how much and how many.

They are used in trends, comparisons, aggregations, and groupings. They feed analysis and visualization tools to allow insights to be discovered about the functional area.

  • It is located at the center of a star schema or a snowflake schema surrounded by dimension tables.
  • Where multiple fact tables are used, these are arranged as a fact constellation schema.
  • A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables.
  • The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.
  • Fact tables contain the content of the data warehouse and store different types of measures like additive, non-additive, and semi-additive measures.
  • A fact table is a collection of keys and measures. The keys relate each row in the fact table to an associated row in a dimension table.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ‘grain’ of a fact table?

A

The grain of a fact table represents the most atomic level by which the facts may be defined.

The grain of a SALES fact table might be stated as “Sales volume by Day by Product by Store”

A fact can be fine grained and represent a single event or transaction or it can be course grained and aggregate measurements over a period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a ‘measure’?

A

A measure is a quantitative attribute of a fact (in a fact table) that is not a foreign key that creates a relationship to a dimension.

Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated.

Measures are data elements such as OrderQuantity, DiscountAmount, and TaxAmount that are used to determine all types of statistical information, such as the percentage of sales that were discounted or the total sales tax collected per region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the ‘star schema’?

A

The simplest style of data warehouse schema, the star schema consists of one or more fact tables referencing any number of dimension tables.

The star schema gets its name from the physical model’s resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star’s points.

The star schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the benefits and disadvantages of the star schema?

A

Benefits

  • due to denormalization:
  • simpler queries
  • simpler business reporting logic
  • query performance gains
  • fast aggregations
  • building OLAP cubes

Disadvantages
- data integrity no enforced, de-normalized state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the ‘Snowflake’ schema?

A

A variant of the Star Schema but the dimension tables are de-normalized and join with multiple related dimension tables.

The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent tables. The snowflake effect affects only the dimension tables and does not affect the fact tables.

Normalization tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes

From a space storage point of view, dimensional tables are typically small compared to fact tables. This often negates the potential storage-space benefits of the star schema as compared to the snowflake schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ‘dimensional modeling’?

A

Dimensional modeling is made up of both logical and physical modeling. Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated. Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.

Fact table > measures and keys > dimension tables
Datamart = a collection of Fact tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain dimensional vs ER modeling?

A

Dimensional modeling is a logical design technique. Unlike ER modeling, which consists of conceptual, logical, and physical data modeling, dimensional modeling is made up of only logical and physical modeling.

There are sharp contrasts between ER and dimensional modeling. ER modeling is a design discipline that seeks to represent business rules as highly detailed relationships between business elements that are materialized as database tables. You can extrapolate the business rules from the types and cardinalities of the relationships in an ER model. The primary goal of ER modeling is to remove all non-key data redundancy.

Dimensional modeling, however, seeks to represent data in a logical, understandable manner. In dimensional modeling, you can control data redundancy by conforming dimension and fact tables. A table that’s been conformed can be used in more than one dimensional data model and is the same no matter how it’s used. The relationships in a dimensional model don’t represent business rules; instead, they’re navigational paths used to help write reports or create graphs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ETL?

A

Extraction, transformation, and loading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you design a fact table?

A
  • identify the business process for analysis (sales)
  • identify the measures of facts (sales dollar)
  • identify the dimensions for facts (product, location, time and org dimensions)
  • list the columns that describe each dimension
  • determine the lowest grain of summary (sales dollars)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the different types of fact tables?

A

Transactional Fact Table

  • One row per fact occurring at a certain point in time
  • “one row per line in a transaction”, e.g., every line on a receipt

Periodic Snapshot Fact Table

  • One row per entity in the fact table
  • takes a “picture of the moment”, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table

Accumulating Snapshot Fact Table

  • holds a ‘snapshot’ of all data for a specific point in time
  • used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order
  • An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Temporal Snapshot Fact Table

  • each row represents a fact that occurred during a time interval not at a point in time
  • llows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two popular design approaches when designing a data warehouse?

A
  • Inmon vs Kimball
  • top down vs bottom up

Bill Inmom

  • top-down design
  • normalized data model is designed first
  • then the dimensional data marts are created from the data warehouse

Ralph Kimball

  • bottom-up design
  • the data marts facilitating reports and analysis are created first
  • those are then combined to create data marts and a broad data warehouse
  • remember data marts are a collection of fact tables and their related dimension tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some industry recommendations when choosing between the two popular design approaches, Inmon vs Kimball?

A

Insurance - need for overall picture with interrelated data elements are best suited for the Inmon approach.

Marketing - specialized, no enterprise data warehouse needed, only data marts required so Kimball’s approach is suitable

CRM in banks - either, depending on the situation and need to link customer to details across business units. If limited scope then Kimball, if not then Inmon is better,

Manufacturing - multiple functions require enterprise model, so Inmon’s is ideal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the ways (5) of analyzing the information requirements for data warehouses?

A

data-driven approach

  • best known is the semi-automatic method the ‘Dimensional Fact Model’
  • creates model from exisitng data sources

requirement-driven approach
* represent aspects of process-driven, goal-driven and user-driven

user-driven approach

  • business users are interviewed to define the requirements
  • ‘View point’ method, many peoplpe with different needs are involved

goal-driven approach

  • modeling of organizations and decisiong
  • top-down, bottom-up and integration
  • goal-question-metric approach for top-down

process-driven approach

  • identify most important business processes that requires measurement and control
  • metric-driven, AsIs and ToBe

integration approach

  • combination of multiple approaches is often best
  • user-driven, data-driven and goal-driven is recommended
17
Q

What is a data warehouse?

A

A datawarehouse is not “a database with a dimensional design”. A DW is a Subject Oriented, Integrated, Nonvolatile and Time Variant collection of data. A data warehouse architecture may or may not include “dimensional” designs - typically only for Marts, not necessarily as part of the DW

18
Q

What is OLAP?

A

On-Line Analytical Processing

It’s a method to analyze data.

OLAP is a set of operations that one can do on a data set, such as pivoting, slicing, dicing, drilling. For example, one can do OLAP operations with Excel PivotTables.

19
Q

What are the most popular OLAP operations?

A

1) Roll Up, drill up or aggregation
- group columns and add up the values
- less detailed data, summary

2) Roll Down or drill down
- reverse of roll up, more detailed data
- descending the concept hierarchy (week to day)

3) Slicing
- selection from ONE dimension

4) Dicing
- selection from TWO or more dimensions

5) Pivot (or Rotate)
- rotates the data axes to view the data from different perspectives
- groups data with different dimensions

20
Q

What is data modeling and what are the different types?

A

Data modeling is the process of creating and extending data models which are visual representations of data and its organization. The ERD Diagram (Entity Relationship Diagram) is the most popular type of data model. Data models exist at multiple levels including:

The Conceptual Data Model describes data from a high level. It defines the problem rather than the solution from the business point of view. It includes entities and their relationships. Typically the conceptual data model is developed first.

The Logical Data Model describes a logical solution to a data project. It provides more details than the conceptual data model and is nearly ready for the creation of a database. These details include attributes, the individual pieces of information that will be included. Typically the logical data model is developed second.

The Physical Data Model describes the implementation of data in a physical database. It is the blueprint for the database. Typically the physical data model is developed third.

21
Q

What are the different models that Cubes in a Data Warehouse may use for storage?

A

Cubes in a data warehouse are stored in three different modes. A relational storage model is called Relational Online Analytical Processing mode or ROLAP, while a Multidimensional Online Analytical processing mode is called MOLAP. When dimensions are stored in a combination of the two modes then it is known as Hybrid Online Analytical Processing mode or HOLAP.

22
Q

What are the uses, advantages or disadvantages when using ROLAP, MOLAP or HOLAP?

A

ROLAP: can handle large amounts of data and leverages relational database funtionalities. Poor performance and limited SQL functionalities.

MOLAP: traditional OLAP model, stored as multi-dimensional ‘Cubes’ not relational models. Excellent query performance. Handles limited amounts of data, proprietary technology and knowledge base.

HOLAP: combines strengths of each; MOLAP for summary data, ROLAP for drilling into details.

23
Q

What is Dimensionality Reduction?

A

The process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Feature selection
- tries to find a subset of the original variables (also called features or attributes). There are three strategies: the filter strategy (e.g. information gain), the wrapper strategy (e.g. search guided by accuracy), and the embedded strategy (features are selected to add or be removed while building the model based on the prediction errors).

Feature extraction
- Transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[4][5] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning.

24
Q

What are some success metrics to consider when analyzing and designing for a data warehouse?

A

Define measurable goals around:

  • performance
  • infrastructure/license/maintenance costs
  • usability/functionality.
25
Q

What are the primary functions of a dimension?

A

The primary functions of dimensions are to provide:

1) Filtering
2) Grouping
3) Labeling

These three functions are also called: ‘Slice and Dice’
Slicing is the filtering of data, Dicing is the grouping of data.

26
Q

What is a slowly changing dimension?

A

A set of data attributes that change slowly over a period of time rather than changing regularly e.g. address or name.

These attributes can change over a period of time and that will get combined as a slowly changing dimension.

These dimension can be classified into several types:

  • Type 0 (Retain original): Attributes never change. No history.
  • Type 1 (Overwrite): Old values are overwritten with new values for attribute. No history.
  • Type 2 (Add new row): For a new value, a new row is created with either a start date / end date or version. This creates a history.
  • Type 3 (Add new attribute): For a new value, a new columm is created. History is limited to the number of columns designated for storing historical data.
  • Type 4 (Add history table): One table keep the current value, where as the history is saved in a second table. This creates a history.
  • Type 5 (Combined Approach 1 + 4): Combination of type 1 and type 4. History is created through a second history table.
  • Type 6 (Combined Approach 1 + 2 + 3): Combination of type 1, type 2 and type 3. History is created through separate row and attributes.
  • Type 7 (Hybrid Approach): Both surrogate and natural key are used.[4]
27
Q

What is a Conformed dimension?

A

A conformed dimension is a dimension that has the same meaning to every fact with which it relates.

A conformed dimension is a set of data attributes that have been physically referenced in multiple database tables using the same key value to refer to the same structure, attributes, domain values, definitions and concepts. A conformed dimension cuts across many facts.

Dimensions are conformed when they are either exactly the same (including keys) or one is a proper subset of the other. Most important, the row headers produced in two different answer sets from the same conformed dimension(s) must be able to match perfectly.’

Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors. At the most basic level, conformed dimensions mean exactly the same thing with every possible fact table to which they are joined. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.[5]

28
Q

What is a Junk Dimension?

A

A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables.

The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators.

These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.

29
Q

What is a Degenerate dimension?

A

Degenerate dimensions, also called fact dimensions, are standard dimensions that are constructed from attribute columns in fact tables instead of from attribute columns in dimension tables.

This is because useful dimensional data is sometimes stored in a fact table to reduce duplication, especially when you have a very large fact table.

  • is a key (like invoice #) that has no attributes and does not join to a dimension table
  • common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent.
30
Q

What is a Role-playing dimension?

A

Role-playing dimensions are dimensions that are recycled for multiple use, like Date (created, hired, etc).

Dimensions are often recycled for multiple applications within the same database. For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”.

This is often referred to as a “role-playing dimension”. This can be implemented using a view over the same dimension table.

31
Q

What is an Outrigger dimension?

A

As per Kimball, when a dimension in the DWH is linked with another dimension table, the secondary table is call an outrigger dimension.

For example, a bank account dimension can reference a separate dimension representing the date the account was opened.

These secondary dimension references are called outrigger dimensions. Outrigger dimensions are permissible, but should be used sparingly.

In most cases, the correlations between dimensions should be demoted to a fact table, where both dimensions are represented as separate foreign keys.

  • Usually dimension tables do not reference other dimensions via foreign keys. When this happens, the referenced dimension is called an outrigger dimension. - Outrigger dimensions should be considered a data warehouse anti-pattern: it is considered a better practice to use some fact tables that relate the two dimensions
32
Q

What is a Shrunken dimension?

A

A shrunken dimension entity is a perfect subset of a more detailed, granular dimension entity.

In this case, the attributes that are common to both the detailed and shrunken subset dimension have the same attribute names, definitions, and domain values.

  • Conformed dimensions are said to be a shrunken dimension when it includes a subset of the rows and/or columns of the original dimension.
33
Q

What is a Calendar date dimension?

A

A special type of dimension can be used to represent dates with a granularity of a day. Dates would be referenced in a fact table as foreign keys to a date dimension. The date dimension primary key could be a surrogate key or a number using the format YYYYMMDD.

The date dimension can include other attributes like the week of the year, or flags representing work days, holidays, etc. It could also include special rows representing: not known dates, or yet to be defined dates.

The date dimension should be initialized with all the required dates, say the next 10 years of dates, or more if required, or past dates if events in the past are handled. Time instead is usually best represented as a timestamp in the fact table.

Calendar date dimensions are attached to virtually every fact table to allow navigation of the fact table through familiar dates, months, fiscal periods, and special days on the calendar.

You would never want to compute Easter in SQL, but rather want to look it up in the calendar date dimension. The calendar date dimension typically has many attributes describing characteristics such as week number, month name, fiscal period, and national holiday indicator.

To facilitate partitioning, the primary key of a date dimension can be more meaningful, such as an integer representing YYYYMMDD, instead of a sequentially-assigned surrogate key. However, the date dimension table needs a special row to represent unknown or to-be-determined dates. If a smart date key is used, filtering and grouping should be based on the dimension table’s attributes, not the smart key.

When further precision is needed, a separate date/time stamp can be added to the fact table. The date/time stamp is not a foreign key to a dimension table, but rather is a standalone column. If business users constrain or group on time-of-day attributes, such as day part grouping or shift number, then you would add a separate time-of-day dimension foreign key to the fact table.

34
Q

Difference between Facts and Dimensions?

A

Fact tables are the core tables of a data warehouse. They contain quantitative information, commonly associated with points in time. They are used in trends, comparisons, aggregations, and groupings. They feed analysis and visualization tools to allow insights to be discovered about the functional area.

Dimensions, on the other hand, are collections of reference information about the facts in a data warehouse. Dimensions categorize and describe the facts recorded in a data warehouse to provide meaningful, categorized, and descriptive answers to business questions.