CH1 Flashcards

1
Q

What is a data warehouse

A
subject-oriented, 
integrated,
time-varying, 
non-volatile 
collection of data that is used primarily in organizational decision making
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

subject oriented

A

Focused on business related subjects and organizational activity
Ex: Customers, employees, products, suppliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Integrated

A

Data from different sources are stored in a consistent format
Requires addressing naming conflicts, unit discrepancies
. same naming format, concept, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

TIme-Varient

A

Maintains historical data

Each source may contain data at different time points –day, week, month etc. Used to analyze trends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

nonvolatile

A

Users cannot change data after it is entered. It is periodically updated.
Structured for query and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Database vs. Data Warehouse

A

database for querying, for transaction processing.
Data warehouse takes the data from all different databases and creates a layer optimized for and dedicated to analytics.

A database is designed to handle transactions (OLTP). It is not structured to do analytics well.

A data warehouse, on the other hand, is structured to make analytics fast and easy.
Database you don’t do analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Need for Data Warehousing

A

Integrated, company-wide view of high-quality information (from disparate databases)
Separation of operational and informational systems and data (for improved performance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Need for a Company-Wide View

A

Data in operational systems are fragmented and inconsistent.
Data is generally distributed on a variety of incompatible hardware and software.
Data must be consolidated to provide a single corporate view

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Issues to resolve

A
Inconsistent key structures!
Synonyms
Free-form vs. structured fields
Inconsistent data values
Missing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Organizational Trends Motivating Data Warehouses

A
No single system of records
Multiple systems not synchronized
Organizational need to analyze activities in a balanced way
Customer relationship management
Supplier relationship management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Warehouse Versus Data Mart

A

DataMart is subset of Data Warehouse. single subject.
DataMart is specific, decentralized, organic, some history, hightly denormalized.. Datamart could be a table in a database. segmented data(partitioned).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Logical Data Mart and Real-Time Data Warehouse

A

capture real time information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Independent Data Mart

Dependent Data Mart and Operational Data Store

A

Dependent: Dependent data marts are created by drawing data directly from operational, external or both sources.
Independent: Independent data mart is created without the use of a central data warehouse.
Hybrid: This type of data marts can take data from data warehouses or operational systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

all datamarts

A

All involve some form of extraction, transformation and loading (ETL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

why use a datamart?

A

Data Mart helps to enhance user’s response time due to reduction in volume of data
It provides easy access to frequently requested data.
Data mart are simpler to implement when compared to corporate Datawarehouse. At the same time, the cost of implementing Data Mart is certainly lower compared with implementing a full data warehouse.
Compared to Data Warehouse, a datamart is agile. In case of change in model, datamart can be built quicker due to a smaller size.
A Datamart is defined by a single Subject Matter Expert. On the contrary data warehouse is defined by interdisciplinary SME from a variety of domains. Hence, Data mart is more open to change compared to Datawarehouse.
Data is partitioned and allows very granular access control privileges.
Data can be segmented and stored on different hardware/software platforms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dependent data mart with operational data store: a three-level architecture

A

ETL . see slide 16

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

near real time

A

can only be done if data is coming in clean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Extraction

A

connect via gatewarys and interfaces, odbc, oractle connect, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

data cleaning and transforming

A

Field length, anomalies, missing data etc.
Data migration -Include simple transforms – (field names etc.)
Data scrubbing - Use domain specific knowledge (zip codes)
Data auditing – discover rules, relationships, suspicious patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Load

A

Check integrity constrains (foreign key, primary key, null etc.)
Sorting, summarization, partition, index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Refresh

A

When and how
When - Depends on user needs, traffic, database server capabilities
How -Triggers (Oracle), Sniff transaction logs (MS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Event

A

Event = a database action (create/update/delete) that results from a transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Status

A

after event status updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

transitory data

A

snapshot, whatever update before is lost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

transient data

A

With transient data, changes to existing records are written over previous records, thus destroying the previous data content

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Periodic Data

A

Periodic data are never physically altered or deleted once they have been added to the store
creaet second colum called Action column. C=current. U= update, you are updating values. so your storing the older rows(historical). D=Deleted data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Database Limitation

A
Relational databases consists of tables
Tables are flat
one-dimensional
Cross-tabs can be 2-dimensional
How to represent multi-dimensional tables?
The job of data warehousing/OLAP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Dimensional Modeling

A

Is a preferred technique to present analytic data

Contains same information as a normalized model

Deliver data that’s understandable to the user
Deliver fast query performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Dimensional Modeling types

A

Star Schemas: Dimensional modeling implemented in relational DBMS (ROLAP)

OLAP (online analytical processing): Dimensional modeling in a multi-dimensional DB environment

OLAP cubes deliver superior query performance because of pre-calculations, indexing strategies, and other optimizations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ROLAP

A

Relational Olap model. Star Schemaa Dimensional modeling implemented in relational dbms. similar to rdms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Star Schema

A

multiple tables that point to 1 fact table in relational dbms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Dimensional table

A

Dimensions are the subjects, who what when why

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Fact table

A

collect data about subjects(dimensions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Dimensions - context

A

who what where when hos of measurement, example total amount of sales, qty, etc.

35
Q

dimensional model

A

characterize in fact or dimension tables.

36
Q

same grain?

A

measurements rows in a fact table must be at the same grain. When creating table, at what detail of specification do you want to save the data. weekly, monthly, or individual transactions. they have to match.

37
Q

Dimension Tables

A

Dimension tables often have many columns (attribute)

Each dimension table contains data for one dimension

Dimension table often represent hierarchical relationships
Product roll up into brands and then into categories

Each dimension is defined by a single primary key (PK)

PK serves as the basis of referential integrity with the given fact table to which it is joined.

38
Q

Online Analytical Processing (OLAP) Tools

A

The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

OLAP provides advanced query capabilities to the warehouse that standard SQL cannot
Complex queries that need to aggregate data can take hours to run
End users cannot be expected to issue SQL statements

39
Q

Cube

A

A multidimensional structure consisting of “Data Cubes”

40
Q

Dimension

A

Sides of a cube

41
Q

Measure

A

Facts in a fact table

42
Q

Aggregation

A

Projection of the cube

43
Q

Cube shape

A

A Cube need not be in the shape of a cube at all.
It can have as many dimensions as necessary.
Each dimension can have as many “members” as necessary.
A cube can have as many dimensions as necessary.

44
Q

what goes inside the cube

A

measures from fact table

45
Q

What is a fact table?

A

The Fact Table is the table that provides the data for the elements of the Cube. There can be only one Fact Table per Cube!

46
Q

What is an Aggregation

A

Is a Projection of the cube.
An Aggregation is a Projection of the Cube
An Aggregation Collapses Dimensions. Summation. similar to SQL “Group BY” Clause

47
Q

OLAP Provides 4 types of Aggregation

A

sum, count, min, max

48
Q

cube slicing

A

come up with 2-D view of data by filtering (fixing) a dimension. like a where clause, slice based on a condition. example slice between the males and femails to compare.

49
Q

dicing

A

come up with a small cube (sub-cube) by selecting a subset of all dimensions

50
Q

drill down

A

going from summary to more detailed views.

51
Q

roll up

A

going from detailed views to a summary view.

52
Q

pivot(rotate)

A

to rotate the cube across a dimension to see various faces.

53
Q

add slide

54
Q

operational systems vs. bi systems

A

information. This asset
is almost always used for two purposes: operational record keeping and analytical
decision making. Simply speaking, the operational systems are where you put the
data in, and the DW/BI system is where you get the data out

55
Q

What are operational Systems?

A

The operational systems are optimized to process transactions quickly.
These systems almost always deal with one transaction record at a time. They predictably
perform the same operational tasks over and over, executing the organization’s
business processes. Given this execution focus, operational systems typically do not
maintain history, but rather update data to refl ect the most current state.

56
Q

BI Users

A

Users of a DW/BI system, on the other hand, watch the wheels of the organization
turn to evaluate performance. They count the new orders and compare them
with last week’s orders, and ask why the new customers signed up, and what the
customers complained about. They worry about whether operational processes are
working correctly. Although they need detailed data to support their constantly
changing questions, DW/BI users almost never deal with one transaction at a time.
These systems are optimized for high-performance queries as users’ questions often
require that hundreds or hundreds of thousands of transactions be searched and
compressed into an answer set. To further complicate matters, users of a DW/BI
system typically demand that historical context be preserved to accurately evaluate
the organization’s performance over time

57
Q

Business management requirements for BI/DataWarehouse

A

The DW/BI system must make
information easily accessible. The contents
of the DW/BI system must be understandable

The DW/BI system must present information consistently. The data in the
DW/BI system must be credible. Data must be carefully assembled from a
variety of sources, cleansed, quality assured, and released only when it is fi t
for user consumption.
The DW/BI system must adapt to change. User needs, business conditions,
data, and technology are all subject to change
The DW/BI system must present information in a timely way
The DW/BI system must be a secure bastion that protects the information
assets
The DW/BI system must serve as the authoritative and trustworthy foundation
for improved decision making
The business community must accept the DW/BI system to deem it successful

58
Q

Dimensional modeling is widely accepted as the preferred technique
for presenting analytic data because it addresses two simultaneous requirements

A

■ Deliver data that’s understandable to the business users.

■ Deliver fast query performance.

59
Q

difference between 3NF and dimensional model

A

Both 3NF and dimensional models can
be represented in ERDs because both consist of joined relational tables; the key
diff erence between 3NF and dimensional models is the degree of normalization.
Because both model types can be presented as ERDs, we refrain from referring to
3NF models as ER models; instead, we call them normalized models to minimize
confusion.

60
Q

same info as normalized model?

A

A dimensional model contains the same information as a normalized
model, but packages the data in a format that delivers user understandability, query
performance, and resilience to change.

61
Q

Star Schemas Versus OLAP Cubes

A

Dimensional models implemented in relational database management systems are
referred to as star schemas because of their resemblance to a star-like structure.
Dimensional models implemented in multidimensional database environments are
referred to as online analytical processing (OLAP) cubes, as illustrated in Figure 1-1.

62
Q

loading data in OLAP Cube

A

When data is loaded into an OLAP cube, it is stored and indexed using formats
and techniques that are designed for dimensional data. Performance aggregations
or precalculated summary tables are often created and managed by the OLAP cube
engine.

63
Q

analytics with OLAP

A

OLAP cubes also provide more analytically
robust functions that exceed those available with SQL. The downside is that you
pay a load performance price for these capabilities, especially with large data sets.

64
Q

what does a fact represent?

A

a business measure

65
Q

how and were store measurement data?

A

You should strive to store the
low-level measurement data resulting from a business process in a single dimensional
model. Because measurement data is overwhelmingly the largest set of data,
it should not be replicated in multiple places for multiple organizational functions
around the enterprise. Allowing business users from multiple organizations to access
a single centralized repository for each set of measurement data ensures the use of
consistent data throughout the enterprise

66
Q

what does each row in a fact table represent?

A

Each row in a fact table corresponds to a measurement event. The data on each
row is at a specifi c level of detail, referred to as the grain, such as one row per product on a sales transaction

67
Q

one to one measurement

A

The idea that a measurement event in the physical world has a one-to-one
relationship to a single row in the corresponding fact table is a bedrock principle
for dimensional modeling. Everything else builds from this foundation.

68
Q

most useful facts

A

The most useful facts are numeric and additive, such as dollar sales amount.
Additivity is crucial because BI applications rarely retrieve a single fact table
row. Rather, they bring back hundreds, thousands, or even millions of fact rows at
a time, and the most useful thing to do with so many rows is to add them up.

69
Q

semi additive and non additive facts

A

Semi-additive facts, such as account balances, cannot be summed across
the time dimension. Non-additive facts, such as unit prices, can never be added.

70
Q

fact table

grains fall into one of three categories

A

ransaction, periodic snapshot, and accumulating

snapshot. Transaction grain fact tables are the most common

71
Q

fact tables and foreign keys

A

All fact tables have two or more foreign keys (refer to the FK notation in Figure 1-2)
that connect to the dimension tables’ primary keys.

72
Q

what is a fact table primary key made of?

A

The fact table generally has its own primary key composed of a subset of the foreign
keys. This key is often called a composite key. Every table that has a composite
key is a fact table. Fact tables express many-to-many relationships. All others are
dimension tables.

73
Q

what does dimension table describe?

A

They
describe the “who, what, where, when, how, and why” associated with the event.
Dimension tables tend to have fewer rows than fact tables, but can be wide with
many large text columns.
Each dimension is defi ned by a single primary key (refer
to the PK notation in Figure 1-3), which serves as the basis for referential integrity
with any given fact table to which it is joined

74
Q

What are dimension tables primary source of?

A

Dimension attributes serve as the primary source of query constraints, groupings,
and report labels. In a query or report request, attributes are identifi ed as the
by words. For example, when a user wants to see dollar sales by brand, brand must
be available as a dimension attribute

75
Q

what should attributes consist of in dimension tables?

A

Attributes should
consist of real words rather than cryptic abbreviations. You should strive to minimize
the use of codes in dimension tables by replacing them with more verbose
14 Chapter 1
textual attributes.In many ways, the data warehouse is only as good as the dimension attributes; the
analytic power of the DW/BI environment is directly proportional to the quality and
depth of the dimension attributes

76
Q

how to tell if it’s a dimension or should be in fact table

A

You often make the decision
by asking whether the column is a measurement that takes on lots of values and
participates in calculations (making it a fact) or is a discretely valued description
that is more or less constant and participates in constraints and row labels (making
it a dimensional attribute). For example, the standard cost for a product seems like
a constant attribute of the product but may be changed so often that you decide it
is more like a measured fact.

77
Q

do fact tables or dimension tables represent fierarchical relationships?

A

dimension tables often represent hierarchical relationships.
For example, products roll up into brands and then into categories. For each
row in the product dimension, you should store the associated brand and category
description. The hierarchical descriptive information is stored redundantly in the
spirit of ease of use and query performance. You should resist the perhaps habitual
urge to normalize data by storing only the brand code in the product dimension and
creating a separate brand lookup table, and likewise for the category description in a
separate category lookup table

78
Q

snowflaking

A

This normalization is called snowfl aking. Instead of
third normal form, dimension tables typically are highly denormalized with fl attened
many-to-one relationships within a single dimension table. Because dimension tables
typically are geometrically smaller than fact tables, improving storage effi ciency by
normalizing or snowfl aking has virtually no impact on the overall database size. You
should almost always trade off dimension table space for simplicity and accessibility

79
Q

what data has the most dimensionality

A

This book illustrates repeatedly that the most granular or atomic data has the
most dimensionality. Atomic data that has not been aggregated is the most expressive
data; this atomic data should be the foundation for every fact table design to
withstand business users’ ad hoc attacks in which they pose unexpected queries.
With dimensional models, you can add completely new dimensions to the schema
as long as a single value of that dimension is defi ned for each existing fact row.
Likewise, you can add new facts to the fact table, assuming that the level of detail
is consistent with the existing fact table. You can supplement preexisting dimension
tables with new, unanticipated attributes. In each case, existing tables can be
changed in place either by simply adding new data rows in the table or by executing
an SQL ALTER TABLE command.

80
Q
SELECT
store.district_name,
product.brand,
sum(sales_facts.sales_dollars) AS "Sales Dollars"
FROM
store,
product,
date,
sales_facts
WHERE
date.month_name="January" AND
date.year=2013 AND
store.store_key = sales_facts.store_key AND
product.product_key = sales_facts.product_key AND
date.date_key = sales_facts.date_key
GROUP BY
store.district_name,
product.bran
A

If you study this code snippet line-by-line, the fi rst two lines under the SELECT
statement identify the dimension attributes in the report, followed by the aggregated
metric from the fact table. The FROM clause identifi es all the tables involved
in the query. The fi rst two lines in the WHERE clause declare the report’s fi lter, and
the remainder declare the joins between the dimension and fact tables. Finally, the
GROUP BY clause establishes the aggregation within the report.

81
Q

Operational Source Systems

A

These are the operational systems of record that capture the business’s transactions.
Think of the source systems as outside the data warehouse because presumably you
have little or no control over the content and format of the data in these operational
systems. The main priorities of the source systems are processing performance and availability.
Operational queries against source systems are narrow, one-record-at-a-time
Data Warehousing, Business Intelligence, and Dimensional Modeling Primer 19
queries that are part of the normal transaction fl ow and severely restricted in their
demands on the operational system

82
Q

The extract, transformation, and load (ETL)

A

environment consists
of a work area, instantiated data structures, and a set of processes. The ETL system
is everything between the operational source systems and the DW/BI presentation
area

83
Q

3 step process of ETL

A

1.Extraction is the fi rst step in the process of getting data into the data warehouse
environment. Extracting means reading and understanding the source data and
copying the data needed into the ETL system for further manipulation. At this
point, the data belongs to the data warehouse.
2.After the data is extracted to the ETL system, there are numerous potential transformations,
such as cleansing the data (correcting misspellings, resolving domain conflicts, dealing with missing elements, or parsing into standard formats), combining
data from multiple sources, and de-duplicating data. The ETL system addsvalue to the data with these cleansing and conforming tasks by changing the data
and enhancing.
3.fi nal step of the ETL process is the physical structuring and loading of data
into the presentation area’s target dimensional models. Because the primary mission
of the ETL system is to hand off the dimension and fact tables in the delivery
step, these subsystems are critical