T3 – Data Warehousing Flashcards

1
Q

What is the definition of the data warehouse?

A

“The data warehouse is a collection of integrated, subject-oriented databases designed to support DSS functions, where each unit of data is non-volatile and relevant to some moment in time”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

From the Characteristics of Data Warehousing: what is subject oriented?

A

data are organized by detailed subject, containing information relevant for decision support.

This ability to define a data warehouse by subject matter makes the data warehouse subject oriented.

e.g. sales, products, or customers. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

From the Characteristics of Data Warehousing: what is integrated?

A

data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

From the Characteristics of Data Warehousing: what is time variant?

A

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

From the Characteristics of Data Warehousing: what is nonvolatile?

A

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What in short are the 4 defining characteristics of the DWH?

A

4 Defining Characteristics:

  • integrated: data will be connected
  • subject-oriented: measured things depend on subject we are treating
  • non-volatile: once data is loaded in data warehouse, it is not updated (tasks will be performed on top of data)
  • relevant to some moment in time: uploading data over time, creating history
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are three main types of data warehouses?

A
  • Data marts
  • Operational Data Stores (ODS)
  • Enterprise Data Warehouses (EDW)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a data mart?

A
  • Whereas a data warehouse combines databases across an entire enterprise, a data mart is usually smaller and focuses on a particular subject or department.
  • A data mart is a subset of a data warehouse, typically consisting of a single subject area (e.g. marketing, operations).
  • Two possibilities:
    • Dependent data mart
    • Independent data mart
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a dependent data mart?

A
  • = subset that is created directly from the DWH
  • Has the advantages of using a consistent data model & provides quality data
  • Dependent data marts support the concept of a single enterprise-wide data model, but the DWH must be constructed first
  • A dependent data mart ensures that the end user is viewing the same version of the data that are accessed by all other DWH users
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an independent data mart?

A
  • Lower-cost, scaled-down version of a DWH
  • A small DM designed for a strategic busines unit or a department, but is source is not an EDW
  • potential disadvantage: each department has their own and e.g. different departments might end up with different numbers -> as e.g. marketing calculates a specific number differently then finance (there is no universal standard)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are OPERATIONAL DATA STORES (ODS)?

A
  • A type of database often used as an interim area for a DWH
  • Unlike the static content of a DWH, the contents of an ODS are updated through the course of business operations.
  • An ODS is used for short-term decisions involving mission-critical applications rather then for the medium and long-term decisions associated with a EDW
  • An ODS consolidates data from multiple source systems and provides a near-real time, integrated view of volatile, current data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are OPER MARTS?

A
  • An operational data mart.
  • An oper mart is a small-scale data mart typically used by a single department or functional area in an organization.
  • Oper marts are created when operational data need to be analyzed multidimensionally.
  • The data for an oper mart comes from an ODS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an ENTERPRISE DATA WAREHOUSE
(EDW)?

A
  • An enterprise data warehouse (EDW) is a large-scale DWH that is used across the enterprise for decision support.
  • The large-scale nature provides integration of data from many sources into a standard format for effective BI and decision support applications.
  • EDW are used to provide data for many types of DSS, including CRM, SCM, BPM, BAZM, PLM, KMS, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Metadata?

A
  • Data about data.
  • In a DWH, metadata describe the contents of a DWH and the manner of its acquisition and use
  • We can classify metadata in: technical vs business metadata
  • We can also make a distinction between:
    • Syntactic metadata – (i.e. data describing the syntax of the data)
    • Structural metadata – (i.e. data describing the structure of the data)
    • Semantic metadata – (i.e. data describing the meaning of the data in a specific domain)
  • Metadata design, creation and usage can involve ethical questions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the Two Data Warehousing Strategies?

A
  • Enterprise-wide warehouse, top down, the Inmon methodology
  • Data mart, bottom up, the Kimball methodology
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Data Mart Strategy (bottom up, Kimball)?

A
  • The most common approach
  • Begins with a single mart and architected marts are added over time for more subject areas
  • Relatively inexpensive and easy to implement
  • Can be used as a proof of concept for data warehousing
  • Can perpetuate the “silos of information” problem
  • Can postpone difficult decisions and activities
  • Requires an overall integration plan
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the Enterprise-Wide Strategy (top down, Inmon)?

A
  • A comprehensive warehouse is built initially
  • An initial dependent data mart is built from subset of the data in the warehouse
  • Additional data marts are built from subsets of the warehouse data
  • Complex project: expensive, time consuming & prone to failure
  • When successful: results in an integrated, scalable warehouse

Note: if organization has ERP and that is the only system available, it is easy to implement the enterprise-wide strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the difference between OLTP and DWH in terms of indexes?

A
  • OLTP: few
  • DWH: many

What are Indexes?
Indexes are data structures which hold field values from the indexed column(s) and pointers to the related record(s). This data structure is then sorted and binary searches are performed to quickly find the record.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between OLTP and DWH in terms of joins?

A
  • OLTP: many
  • DWH: some
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the difference between OLTP and DWH in terms of duplicated data?

A
  • OLTP: normalized DBMS (=database management system)
  • DWH: denormalized DBMS (=database management system)

Normalization = minimize the redundancy

21
Q

What is the difference between OLTP and DWH in terms of derived data and aggregates?

A
  • OLTP: rare
  • DWH: common
22
Q

What is dimensional modelling?

A
  • A retrieval-based system that supports high-volume query access
  • The so called dimensional modeling data base design technique was specifically conceived for data warehouse design, answering the limitation found in the traditional approaches to data base design.
  • This approach was defined initially by Kimball (1996), who claimed that he wasn’t inventing nothing new, just documenting what people were doing for many years.
23
Q

What are the objectives of dimensional modelling?

A
  • To produce data base structures the end user easily understands and can query them.
  • To optimize queries performance (in opposition to update performance)
24
Q

What are facts?

A

Facts are the important entity: a sale

Facts have measures that can be aggregated: sales price

25
Q

What are dimensions?

A

Dimensions describe facts

A sale has the dimensions Product, Store and Time

26
Q

What are the goals for dimensional modeling?

A
  • Surround facts with as much context (dimensions) as possible
  • But you should not try to model all relationships in the data (unlike ER modeling!)
27
Q

What is a normalized database system?

A

normalized database system = whenever there is a change in e.g. address, it will only be changed ONCE in ONE of the tables, other tables only refer to that table then -> enforcing referential integrity

OPTIMIZED for transactional data systems

28
Q

What are the characteristics of a cube?

A
  • A “cube” may have many dimensions!
  • Theoretically no limit for the number of dimensions (Typically 4-12 dimensions, But only 2-3 dimensions can be viewed at a time)
  • Dimensionality reduced by queries via projection/aggregation
  • A cube consists of cells
    • A given combination of dimension values
    • empty cell = no data for this combination
29
Q

What is a sparse cube vs a dense cube?

A
  • Sparse cube: few non-empty cells
  • Dense cube: many non-empty cells

Note: Cubes become sparse at high dimensionality

30
Q

What are Fact Tables?

A
  • A fact table contains a large number of rows that correspond to observed business or facts.
  • A fact table contains:
    • attributes needed to perform decision analysis,
    • descriptive attributes used for query reporting,
    • foreign keys to link to dimension tables.
  • The decision analysis attributes consist of:
    • performance measures,
    • operational metrics,
    • aggregated measures,
    • and all the other metrics needed to analyze the organization’s performance.

In other words, the fact table primarily addresses what the data warehouse supports for decision analysis

31
Q

What are FACTS?

A
  • Facts represent the subject of the desired analysis
    • The “important” in the business that should be analyzed
  • A fact is most often identified via its dimension values
    • A fact is a non-empty cell
    • Some models give facts an explicit identity
  • Generally a fact should
    • Be attached to exactly one dimension value in each dimension
    • Only be attached to dimension values in the bottom levels
32
Q

What is an Event fact?

A

Event fact (transaction), a fact for every business event (sale)

33
Q

What is a “fact-less” fact?

A

“Fact-less” facts:

  • A fact per event (customer contact)
  • No numerical measures
  • An event has happened for a given dimension value combination
34
Q

What is a snapshot fact?

A

Snapshot fact

  • A fact for every dimension combination at given time intervals
  • Captures current status (inventory)
  • Cannot be summed over time
35
Q

What are cumulative snapshot facts?

A

Cumulative snapshot facts

  • A fact for every dimension combination at given time intervals
  • Captures cumulative status up to now (sales in year to date)
  • e.g. sales
36
Q

What is the granularity of facts?

A

granularity = decision on the lowest level of detail when recording the data. (you can always summarize to a higher level, but not go into more detail later)

  • What does a single fact mean?
  • Level of detail
  • Given by combination of bottom levels
  • Example: “total sales per store per day per product”
37
Q

What are measures?

A

Measures represent the fact property that the users want to study and optimize
Example: total sales price

38
Q

What are the two components of a measure?

A

A measure has two components

  • Numerical value: (sales price)
  • Aggregation formula(SUM): used for aggregating/combining a number of measure values into one

Measure value determined by dimension value combination
Measure value is me
aningful for all aggregation levels

39
Q

What are the three types of measures that exist?

A

additive, semi-additive, non-additive

40
Q

What are a property, example and occurance of a semi-additive measure?

A
  • Property: Cannot be aggregated over some dimensions, typically time
  • Example: Inventory
  • Occurrence: Often occur in snapshot facts
41
Q

What are a property, example and occurance of an additive measure?

A
  • Property: Can be aggregated over all dimensions
  • Example: sales price
  • Occurrence: Often occur in event facts
42
Q

What are a property, example and occurance of a non-additive measure?

A
  • Property: Cannot be aggregated over any dimensions
  • Example: average sales price
  • Occurrence: occur in all types of facts
43
Q

What are dimension tables?

A
  • Surrounding the central fact tables (and linked via foreign keys)
  • contain classification & aggregation information about the central fact rows.
  • contain attributes that describe the fact table data, address how it will be analyzed.
  • have a one-to-many relationship with rows in the central fact table
  • examples: location, time, and size
44
Q

What are hierarchies in dimensions?

A

Dimensions have hierarchies with levels

  • Typically 3-5 levels (of detail)
  • Dimension values are organized in a tree structure
    • Product: Product -> Type -> Category
    • Store: Store -> Area -> City -> County
    • Time: Day -> Month -> Quarter -> Year
  • Dimensions have a bottom level and a top level (ALL)
45
Q

What are attributes in dimensions?

A
  • Levels may have attributes
    • Simple, non-hierarchical information
    • Day has Workday as attribute
  • Generally, dimension attributes are used as labels for reports and query criteria e.g. Country=’USA’.
  • Dimension attributes can contain one or more hierarchical relations.
46
Q

What are two typical DWH schemas?

A

Star schema

Snowflake schema

47
Q

What are the Star Schema Advantages and Disadvantages?

A

Advantages:

  • Easily understandable, even by non-technical users
  • Supports increased performance and lower query times
  • Easily expansible, answering the needs for future changes
  • Less Complexity

Disadvantages:

  • More Redundancy

General:

  • Dimensional Schema
  • Desnormalized Schema
  • Kimball’s Reference
48
Q

What are the Snowflake Schema Advantages and
Disadvantages?

A
  • (+) Less Redundancy
  • (-) More Complexity
  • Relational Schema
  • Normalized Schema
  • Inmon’s Reference