T3 – Data Warehousing Flashcards

Question 1

Q

What is the definition of the data warehouse?

Answer

A

“The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”

Question 2

Q

From the Characteristics of Data Warehousing: what is subject oriented?

Answer

A

data are organized by detailed subject, containing information relevant for decision support.

This ability to define a data warehouse by subject matter makes the data warehouse subject oriented.

e.g. sales, products, or customers. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?”

Question 3

Q

From the Characteristics of Data Warehousing: what is integrated?

Answer

A

data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Question 4

Q

From the Characteristics of Data Warehousing: what is time variant?

Answer

A

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

Question 5

Q

From the Characteristics of Data Warehousing: what is nonvolatile?

Answer

A

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred

Question 6

Q

What in short are the 4 defining characteristics of the DWH?

Answer

A

4 Defining Characteristics:

integrated: data will be connected
subject-oriented: measured things depend on subject we are treating
non-volatile: once data is loaded in data warehouse, it is not updated (tasks will be performed on top of data)
relevant to some moment in time: uploading data over time, creating history

Question 7

Q

What are three main types of data warehouses?

Answer

A

Data marts
Operational Data Stores (ODS)
Enterprise Data Warehouses (EDW)

Question 8

Q

What is a data mart?

Answer

A

Whereas a data warehouse combines databases across an entire enterprise, a data mart is usually smaller and focuses on a particular subject or department.
A data mart is a subset of a data warehouse, typically consisting of a single subject area (e.g. marketing, operations).
Two possibilities:
- Dependent data mart
- Independent data mart

Question 9

Q

What is a dependent data mart?

Answer

A

= subset that is created directly from the DWH
Has the advantages of using a consistent data model & provides quality data
Dependent data marts support the concept of a single enterprise-wide data model, but the DWH must be constructed first
A dependent data mart ensures that the end user is viewing the same version of the data that are accessed by all other DWH users

Question 10

Q

What is an independent data mart?

Answer

A

Lower-cost, scaled-down version of a DWH
A small DM designed for a strategic busines unit or a department, but is source is not an EDW
potential disadvantage: each department has their own and e.g. different departments might end up with different numbers -> as e.g. marketing calculates a specific number differently then finance (there is no universal standard)

Question 11

Q

What are OPERATIONAL DATA STORES (ODS)?

Answer

A

A type of database often used as an interim area for a DWH
Unlike the static content of a DWH, the contents of an ODS are updated through the course of business operations.
An ODS is used for short-term decisions involving mission-critical applications rather then for the medium and long-term decisions associated with a EDW
An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of volatile, current data.

Question 12

Q

What are OPER MARTS?

Answer

A

An operational data mart.
An oper mart is a small-scale data mart typically used by a single department or functional area in an organization.
Oper marts are created when operational data need to be analyzed multidimensionally.
The data for an oper mart comes from an ODS

Question 13

Q

What is an ENTERPRISE DATA WAREHOUSE
(EDW)?

Answer

A

An enterprise data warehouse (EDW) is a large-scale DWH that is used across the enterprise for decision support.
The large-scale nature provides integration of data from many sources into a standard format for effective BI and decision support applications.
EDW are used to provide data for many types of DSS, including CRM, SCM, BPM, BAZM, PLM, KMS, etc.

Question 14

Q

What is Metadata?

Answer

A

Data about data.
In a DWH, metadata describe the contents of a DWH and the manner of its acquisition and use
We can classify metadata in: technical vs business metadata
We can also make a distinction between:
- Syntactic metadata – (i.e. data describing the syntax of the data)
- Structural metadata – (i.e. data describing the structure of the data)
- Semantic metadata – (i.e. data describing the meaning of the data in a specific domain)
Metadata design, creation and usage can involve ethical questions

Question 15

Q

What are the Two Data Warehousing Strategies?

Answer

A

Enterprise-wide warehouse, top down, the Inmon methodology
Data mart, bottom up, the Kimball methodology

Question 16

Q

What is the Data Mart Strategy (bottom up, Kimball)?

Answer

A

The most common approach
Begins with a single mart and architected marts are added over time for more subject areas
Relatively inexpensive and easy to implement
Can be used as a proof of concept for data warehousing
Can perpetuate the “silos of information” problem
Can postpone difficult decisions and activities
Requires an overall integration plan

Question 17

Q

What is the Enterprise-Wide Strategy (top down, Inmon)?

Answer

A

A comprehensive warehouse is built initially
An initial dependent data mart is built from subset of the data in the warehouse
Additional data marts are built from subsets of the warehouse data
Complex project: expensive, time consuming & prone to failure
When successful: results in an integrated, scalable warehouse

Note: if organization has ERP and that is the only system available, it is easy to implement the enterprise-wide strategy

Question 18

Q

What is the difference between OLTP and DWH in terms of indexes?

Answer

A

OLTP: few
DWH: many

What are Indexes?
Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.

Question 19

Q

What is the difference between OLTP and DWH in terms of joins?

Answer

A

OLTP: many
DWH: some

Question 20

Q

What is the difference between OLTP and DWH in terms of duplicated data?

Answer

A

OLTP: normalized DBMS (=database management system)
DWH: denormalized DBMS (=database management system)

Normalization = minimize the redundancy

Question 21

Q

What is the difference between OLTP and DWH in terms of derived data and aggregates?

Answer

A

OLTP: rare
DWH: common

Question 22

Q

What is dimensional modelling?

Answer

A

A retrieval-based system that supports high-volume query access
The so called dimensional modeling data base design technique was specifically conceived for data warehouse design, answering the limitation found in the traditional approaches to data base design.
This approach was defined initially by Kimball (1996), who claimed that he wasn’t inventing nothing new, just documenting what people were doing for many years.

Question 23

Q

What are the objectives of dimensional modelling?

Answer

A

To produce data base structures the end user easily understands and can query them.
To optimize queries performance (in opposition to update performance)

Question 24

Q

What are facts?

Answer

A

Facts are the important entity: a sale

Facts have measures that can be aggregated: sales price

Question 25

Q

What are dimensions?

Answer

A

Dimensions describe facts

A sale has the dimensions Product, Store and Time

Question 26

Q

What are the goals for dimensional modeling?

Answer

A

Surround facts with as much context (dimensions) as possible
But you should not try to model all relationships in the data (unlike ER modeling!)

Question 27

Q

What is a normalized database system?

Answer

A

normalized database system = whenever there is a change in e.g. address, it will only be changed ONCE in ONE of the tables, other tables only refer to that table then -> enforcing referential integrity

OPTIMIZED for transactional data systems

Question 28

Q

What are the characteristics of a cube?

Answer

A

A “cube” may have many dimensions!
Theoretically no limit for the number of dimensions (Typically 4-12 dimensions, But only 2-3 dimensions can be viewed at a time)
Dimensionality reduced by queries via projection/aggregation
A cube consists of cells
- A given combination of dimension values
- empty cell = no data for this combination

Question 29

Q

What is a sparse cube vs a dense cube?

Answer

A

Sparse cube: few non-empty cells
Dense cube: many non-empty cells

Note: Cubes become sparse at high dimensionality

Question 30

Q

What are Fact Tables?

Answer

A

A fact table contains a large number of rows that correspond to observed business or facts.
A fact table contains:
- attributes needed to perform decision analysis,
- descriptive attributes used for query reporting,
- foreign keys to link to dimension tables.
The decision analysis attributes consist of:
- performance measures,
- operational metrics,
- aggregated measures,
- and all the other metrics needed to analyze the organization’s performance.

In other words, the fact table primarily addresses what the data warehouse supports for decision analysis

Question 31

Q

What are FACTS?

Answer

A

Facts represent the subject of the desired analysis
- The “important” in the business that should be analyzed
A fact is most often identified via its dimension values
- A fact is a non-empty cell
- Some models give facts an explicit identity
Generally a fact should
- Be attached to exactly one dimension value in each dimension
- Only be attached to dimension values in the bottom levels

Question 32

Q

What is an Event fact?

Answer

A

Event fact (transaction), a fact for every business event (sale)

Question 33

Q

What is a “fact-less” fact?

Answer

A

“Fact-less” facts:

A fact per event (customer contact)
No numerical measures
An event has happened for a given dimension value combination

Question 34

Q

What is a snapshot fact?

Answer

A

Snapshot fact

A fact for every dimension combination at given time intervals
Captures current status (inventory)
Cannot be summed over time

Question 35

Q

What are cumulative snapshot facts?

Answer

A

Cumulative snapshot facts

A fact for every dimension combination at given time intervals
Captures cumulative status up to now (sales in year to date)
e.g. sales

Question 36

Q

What is the granularity of facts?

Answer

A

granularity = decision on the lowest level of detail when recording the data. (you can always summarize to a higher level, but not go into more detail later)

What does a single fact mean?
Level of detail
Given by combination of bottom levels
Example: “total sales per store per day per product”

Question 37

Q

What are measures?

Answer

A

Measures represent the fact property that the users want to study and optimize
Example: total sales price

Question 38

Q

What are the two components of a measure?

Answer

A

A measure has two components

Numerical value: (sales price)
Aggregation formula(SUM): used for aggregating/combining a number of measure values into one

Measure value determined by dimension value combination
Measure value is meaningful for all aggregation levels

Question 39

Q

What are the three types of measures that exist?

Answer

A

additive, semi-additive, non-additive

Question 40

Q

What are a property, example and occurance of a semi-additive measure?

Answer

A

Property: Cannot be aggregated over some dimensions, typically time
Example: Inventory
Occurrence: Often occur in snapshot facts

Question 41

Q

What are a property, example and occurance of an additive measure?

Answer

A

Property: Can be aggregated over all dimensions
Example: sales price
Occurrence: Often occur in event facts

Question 42

Q

What are a property, example and occurance of a non-additive measure?

Answer

A

Property: Cannot be aggregated over any dimensions
Example: average sales price
Occurrence: occur in all types of facts

Question 43

Q

What are dimension tables?

Answer

A

Surrounding the central fact tables (and linked via foreign keys)
contain classification & aggregation information about the central fact rows.
contain attributes that describe the fact table data, address how it will be analyzed.
have a one-to-many relationship with rows in the central fact table
examples: location, time, and size

Question 44

Q

What are hierarchies in dimensions?

Answer

A

Dimensions have hierarchies with levels

Typically 3-5 levels (of detail)
Dimension values are organized in a tree structure
- Product: Product -> Type -> Category
- Store: Store -> Area -> City -> County
- Time: Day -> Month -> Quarter -> Year
Dimensions have a bottom level and a top level (ALL)

Question 45

Q

What are attributes in dimensions?

Answer

A

Levels may have attributes
- Simple, non-hierarchical information
- Day has Workday as attribute
Generally, dimension attributes are used as labels for reports and query criteria e.g. Country=’USA’.
Dimension attributes can contain one or more hierarchical relations.

Question 46

Q

What are two typical DWH schemas?

Answer

A

Star schema

Snowflake schema

Question 47

Q

What are the Star Schema Advantages and Disadvantages?

Answer

A

Advantages:

Easily understandable, even by non-technical users
Supports increased performance and lower query times
Easily expansible, answering the needs for future changes
Less Complexity

Disadvantages:

More Redundancy

General:

Dimensional Schema
Desnormalized Schema
Kimball’s Reference

Question 48

Q

What are the Snowflake Schema Advantages and
Disadvantages?

Answer

A

(+) Less Redundancy
(-) More Complexity
Relational Schema
Normalized Schema
Inmon’s Reference