Data Warehousing - Data Modeling Flashcards

1
Q

What is a ‘dimension’ and what are it’s primary functions?

A

Dimensions enable business intelligence users to analyze data using simple queries. They focus on questions of: who, when, where and what.

The primary key of a dimension should be a surrogate key and is typically an integer. This primary key is then related to facts to put the facts into context.

A dimension is a structure/table that categorizes ‘facts’ and ‘measures’ in order to enable users to answer business questions. They provide structured labeling information to otherwise unordered numeric measures.

Commonly used dimensions are: people, products, place and time. It is a data-set of individual, non-overlapping data elements.

The primary functions of dimensions are to provide:

1) Filtering
2) Grouping
3) Labeling

These three functions are also called: ‘Slice and Dice’
Slicing is the filtering of data, Dicing is the grouping of data.

Dimension tables usually have a relatively small number of records compared to fact tables, but each record may have a very large number of attributes to describe the fact data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different ‘types’ of dimensions?

A

1) Conformed dimension
- dimensions that share same key and attributes, structure. Identical or a subset.

2) Junk dimension
A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. The nature of these attributes is usually text or various flags, e.g. non-generic comments or just simple yes/no or true/false indicators. These kinds of attributes are typically remaining when all the obvious dimensions in the business process have been identified and thus the designer is faced with the challenge of where to put these attributes that do not belong in the other dimensions.

3) Degenerate dimension
- is a key (like invoice #) that has no attributes and does not join to a dimension table
- common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent.

4) Role-playing dimension
- recycled for multiple use, like Date (created, hired, etc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a ‘fact’ table?

A

Facts contain quantitative measurements. They focus on the answering the questions: how much and how many.

Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.

A fact table is a collection of keys and measures. The keys relate each row in the fact table to an associated row in a dimension table.

A Fact table consists of the measurements, metrics or facts of a business process.

A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the ‘grain’ of a fact table?

A

The grain of a fact table represents the most atomic level by which the facts may be defined.

The grain of a SALES fact table might be stated as “Sales volume by Day by Product by Store”

A fact can be fine grained and represent a single event or transaction or it can be course grained and aggregate measurements over a period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a ‘measure’?

A

Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated.

-Measures (anything that is not a ‘key’ column), are the operational data that is made ready for analysis.

Measures are data elements such as OrderQuantity, DiscountAmount, and TaxAmount that are used to determine all types of statistical information, such as the percentage of sales that were discounted or the total sales tax collected per region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the ‘star schema’?

A

The simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts.

The star schema consists of one or more fact tables referencing any number of dimension tables.

The star schema gets its name from the physical model’s resemblance to a star shape with a fact table at its center and the dimension tables surrounding it representing the star’s points.

The star schema separates business process data into facts, which hold the measurable, quantitative data about a business, and dimensions which are descriptive attributes related to fact data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the benefits and disadvantages of the star schema?

A

Benefits

  • due to denormalization:
  • simpler queries
  • simpler business reporting logic
  • query performance gains
  • fast aggregations
  • building OLAP cubes

Disadvantages
- data integrity no enforced, de-normalized state

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the ‘Snowflake’ schema?

A

Same as the Star Schema but the dimension tables are de-normalized and join with small dimension tables.

Normalization tends to increase the number of tables that need to be joined in order to perform a given query, but reduces the space required to hold the data and the number of places where it needs to be updated if the data changes

From a space storage point of view, dimensional tables are typically small compared to fact tables. This often negates the potential storage-space benefits of the star schema as compared to the snowflake schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ‘dimensional modeling’?

A

Dimensional modeling is made up of both logical and physical modeling. Measures are the core of the dimensional model and are data elements that can be summed, averaged, or mathematically manipulated. Fact tables are at the center of the star schema (which is a physically implemented dimensional model), and data marts are made up of multiple fact tables.

Fact table > measures and keys > dimension tables
Datamart = a collection of Fact tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain dimensional vs ER modeling?

A

Dimensional modeling is a logical design technique. Unlike ER modeling, which consists of conceptual, logical, and physical data modeling, dimensional modeling is made up of only logical and physical modeling.

There are sharp contrasts between ER and dimensional modeling. ER modeling is a design discipline that seeks to represent business rules as highly detailed relationships between business elements that are materialized as database tables. You can extrapolate the business rules from the types and cardinalities of the relationships in an ER model. The primary goal of ER modeling is to remove all non-key data redundancy.

Dimensional modeling, however, seeks to represent data in a logical, understandable manner. In dimensional modeling, you can control data redundancy by conforming dimension and fact tables. A table that’s been conformed can be used in more than one dimensional data model and is the same no matter how it’s used. The relationships in a dimensional model don’t represent business rules; instead, they’re navigational paths used to help write reports or create graphs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is ETL?

A

Extraction, transformation, and loading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you design a fact table?

A
  • identify the business process for analysis (sales)
  • identify the measures of facts (sales dollar)
  • identify the dimensions for facts (product, location, time and org dimensions)
  • list the columns that describe each dimension
  • determine the lowest grain of summary (sales dollars)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the different types of fact tables?

A

Transactional Fact Table

  • One row per fact occurring at a certain point in time
  • “one row per line in a transaction”, e.g., every line on a receipt

Periodic Snapshot Fact Table

  • One row per entity in the fact table
  • takes a “picture of the moment”, where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table

Accumulating Snapshot Fact Table

  • holds a ‘snapshot’ of all data for a specific point in time
  • used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order
  • An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it’s important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row.

Temporal Snapshot Fact Table

  • each row represents a fact that occurred during a time interval not at a point in time
  • llows to have the equivalent of daily snapshots without really having daily snapshots. It introduces the concept of Time Intervals into a fact table, allowing to save a lot of space, optimizing performances while allowing the end user to have the logical equivalent of the “picture of the moment” they are interested in.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two popular design approaches when designing a data warehouse?

A
  • Inmon vs Kimball
  • top down vs bottom up

Bill Inmom

  • top-down design
  • normalized data model is designed first
  • then the dimensional data marts are created from the data warehouse

Ralph Kimball

  • bottom-up design
  • the data marts facilitating reports and analysis are created first
  • those are then combined to create data marts and a broad data warehouse
  • remember data marts are a collection of fact tables and their related dimension tables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some industry recommendations when choosing between the two popular design approaches, Inmon vs Kimball?

A

Insurance - need for overall picture with interrelated data elements are best suited for the Inmon approach.

Marketing - specialized, no enterprise data warehouse needed, only data marts required so Kimball’s approach is suitable

CRM in banks - either, depending on the situation and need to link customer to details across business units. If limited scope then Kimball, if not then Inmon is better,

Manufacturing - multiple functions require enterprise model, so Inmon’s is ideal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the ways (5) of analyzing the information requirements for data warehouses?

A

data-driven approach

  • best known is the semi-automatic method the ‘Dimensional Fact Model’
  • creates model from exisitng data sources

requirement-driven approach
* represent aspects of process-driven, goal-driven and user-driven

user-driven approach

  • business users are interviewed to define the requirements
  • ‘View point’ method, many peoplpe with different needs are involved

goal-driven approach

  • modeling of organizations and decisiong
  • top-down, bottom-up and integration
  • goal-question-metric approach for top-down

process-driven approach

  • identify most important business processes that requires measurement and control
  • metric-driven, AsIs and ToBe

integration approach

  • combination of multiple approaches is often best
  • user-driven, data-driven and goal-driven is recommended
17
Q

What is a data warehouse?

A

A datawarehouse is not “a database with a dimensional design”. A DW is a Subject Oriented, Integrated, Nonvolatile and Time Variant collection of data. A data warehouse architecture may or may not include “dimensional” designs - typically only for Marts, not necessarily as part of the DW

18
Q

What is OLAP?

A

On-Line Analytical Processing

It’s a method to analyze data.

OLAP is a set of operations that one can do on a data set, such as pivoting, slicing, dicing, drilling. For example, one can do OLAP operations with Excel PivotTables.

19
Q

What are the most popular OLAP operations?

A

1) Roll Up, drill up or aggregation
- group columns and add up the values
- less detailed data, summary

2) Roll Down or drill down
- reverse of roll up, more detailed data
- descending the concept hierarchy (week to day)

3) Slicing
- selection from ONE dimension

4) Dicing
- selection from TWO or more dimensions

5) Pivot (or Rotate)
- rotates the data axes to view the data from different perspectives
- groups data with different dimensions

20
Q

What is data modeling and what are the different types?

A

Data modeling is the process of creating and extending data models which are visual representations of data and its organization. The ERD Diagram (Entity Relationship Diagram) is the most popular type of data model. Data models exist at multiple levels including:

The Conceptual Data Model describes data from a high level. It defines the problem rather than the solution from the business point of view. It includes entities and their relationships. Typically the conceptual data model is developed first.

The Logical Data Model describes a logical solution to a data project. It provides more details than the conceptual data model and is nearly ready for the creation of a database. These details include attributes, the individual pieces of information that will be included. Typically the logical data model is developed second.

The Physical Data Model describes the implementation of data in a physical database. It is the blueprint for the database. Typically the physical data model is developed third.

21
Q

What are the different models that Cubes in a Data Warehouse may use for storage?

A

Cubes in a data warehouse are stored in three different modes. A relational storage model is called Relational Online Analytical Processing mode or ROLAP, while a Multidimensional Online Analytical processing mode is called MOLAP. When dimensions are stored in a combination of the two modes then it is known as Hybrid Online Analytical Processing mode or HOLAP.

22
Q

What are the uses, advantages or disadvantages when using ROLAP, MOLAP or HOLAP?

A

ROLAP: can handle large amounts of data and leverages relational database funtionalities. Poor performance and limited SQL functionalities.

MOLAP: traditional OLAP model, stored as multi-dimensional ‘Cubes’ not relational models. Excellent query performance. Handles limited amounts of data, proprietary technology and knowledge base.

HOLAP: combines strengths of each; MOLAP for summary data, ROLAP for drilling into details.

23
Q

What is Dimensionality Reduction?

A

The process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Feature selection
- tries to find a subset of the original variables (also called features or attributes). There are three strategies: the filter strategy (e.g. information gain), the wrapper strategy (e.g. search guided by accuracy), and the embedded strategy (features are selected to add or be removed while building the model based on the prediction errors).

Feature extraction
- Transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist.[4][5] For multidimensional data, tensor representation can be used in dimensionality reduction through multilinear subspace learning.

24
Q

What are some success metrics to consider when analyzing and designing for a data warehouse?

A

Define measurable goals around:

  • performance
  • infrastructure/license/maintenance costs
  • usability/functionality.