Data Warehouse Flashcards

Question 1

Q

What is single, complete and consistent store of data
obtained from a variety of different sources made available to end
users in a what they can understand and use in a business context.

Answer

A

A Data Warehouse

Question 2

Q

Why Do we use data ware houses?

Answer

A

Consolidation of information resources
Improved query performance
Separate research and decision support functions from the
operational systems
Foundation for data mining, data visualization, advanced reporting
and OLAP tools

Question 3

Q

What is a data warehouse used for?

Answer

A

Knowledge discovery
Making consolidated reports
Finding relationships and correlations
Data mining

Question 4

Q

The queryable source of data in the enterprise that
is comprised of the union of all of its constituent data marts is called what?

Answer

A

Data Warehouse

Question 5

Q

A logical subset of the complete data warehouse that is often
viewed as a restriction of the data warehouse to a single business
process or to a group of related business processes targeted toward a
particular business group is called what?

Answer

A

Data Mart

Question 6

Q

What is a point of integration for operational
systems that developed independent of each other.

Answer

A

Operational Data Store (ODS)

Question 7

Q

What are Kimball methodologies?

Answer

A

Logical data warehouse (BUS), made up of subject areas (data marts)
Business driven, users have active participation
Decentralized data marts (not required to be a separate physical data store)
Independent dimensional data marts optimized for reporting/analytics
Integrated via Conformed Dimensions (provides consistency across data sources)
2-tier (data mart, cube), less ETL, no data duplication

Question 8

Q

What are Inmon methodologies?

Answer

A

Enterprise data model (CIF) that is a enterprise data warehouse (EDW)
IT Driven, users have passive participation
Centralized atomic normalized tables (off limit to end users)
Later create dependent data marts that are separate physical subsets of data and can be used for multiple
purposes
Integration via enterprise data model
3-tier (data warehouse, data mart, cube), duplication of data

Question 9

Q

What are the differences between ETL vs ELT?

Answer

A

Extract, Transform, and Load (ETL)
* Transform while hitting source system
* No staging tables
* Processing done by ETL tools (SSIS)
Extract, Load, Transform (ELT)
* Uses staging tables
* Processing done by target database engine (SSIS: Execute T-SQL Statement task instead of Data Flow
Transform tasks)
* Use for big volumes of data
* Use when source and target databases are the same
* Use with the Analytics Platform System (APS)
ELT is better since database engine is more efficient than SSIS
* Best use of database engine: Transformations
* Best use of SSIS: Data pipeline and workflow management

Question 10

Q

What is a unique identifier not derived from source system?

Answer

A

Surrogate Keys

Question 11

Q

Describe a star schema?

Answer

A

Data-modeling technique
Maps multidimensional decision support data into relational
database
It creates near equivalent of multidimensional database
schema from relational data
It yield easily implemented model for multidimensional
data analysis while preserving relational structures
On which the operational database is built
The basic star schema has
Four components: facts, dimensions, attributes, and attribute
hierarchies

Question 12

Q

Describe facts

Answer

A

Numeric measurements that represent specific business aspect or
activity
Normally stored in fact table that is center of star schema
Fact table contains facts linked through their dimensions
Metrics are facts computed at run time

Question 13

Q

Describe Dimensions

Answer

A

Qualifying characteristics provide additional perspectives to a given
fact
Decision support data almost always viewed in relation to other data
Study facts via dimensions
Dimensions stored in dimension tables
Each dimension record is related to thousands of fact records
Dimensions are represented in physical tables in data warehouse
database
Dimension tables are smaller than fact tables

Question 14

Q

Describe slowly Changing Dimensions (SCD)

Answer

A

Dimensions that change slowly over time, rather than changing on
regular schedule, time-base.
In Data Warehouse there is a need to track changes in dimension
attributes in order to report historical data.
Implementing one of the SCD types should enable users assigning
proper dimension’s attribute value for given date.
Example of such dimensions could be: customer, geography, employee.

Question 15

Q

How do you handle SCD types?

Answer

A

Type 0 - The passive method
Type 1 - Overwriting the old value
Type 2 - Creating a new additional record
Type 3 - Adding a new column
Type 4 - Using historical table
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)

Question 16

Q

SCD Type 0

Answer

A

The passive method. In this method no special action is performed
upon dimensional changes. Some dimension data can remain the
same as it was first time inserted, others may be overwritten.

Question 17

Q

SCD Type 1

Answer

A

Overwriting the old value. In this method no history of dimension
changes is kept in the database. The old dimension value is simply
overwritten be the new one. This type is easy to maintain and is often
use for data which changes are caused by processing corrections(e.g.
removal special characters, correcting spelling errors)

Question 18

Q

SCD Type 2

Answer

A

Creating a new additional record. In this methodology all history of dimension
changes is kept in the database.
You capture attribute change by adding a new row with a new surrogate key to
the dimension table. Both the prior and new rows contain as attributes the
natural key(or other durable identifier).

Question 19

Q

SDC type 3

Answer

A

Adding a new column. In this type usually only the current and
previous value of dimension is kept in the database. The new value is loaded into ‘current/new’ column and the old one into ‘old/previous’ column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly
needed technique.

Question 20

Q

SCD Type 4

Answer

A

Using historical table. In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension. The ‘main’ dimension table keeps only the current data e.g. customer and customer_history tables.

Question 21

Q

SDC Type 6

Answer

A

Combine approaches of types 1,2,3
(1+2+3=6). In this type we have in dimension table such additional columns
* current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.

historical_type - for keeping historical value of the attribute. All
history records for given item of attribute could have different values.
start_date - for keeping start date of ‘effective date’ of attribute’s
history.
end_date - for keeping end date of ‘effective date’ of attribute’s
history.
current_flag - for keeping information about the most recent record.

Question 22

Q

Describe Attributes

Answer

A

Each dimension table contains attributes.
Attributes are often used to search, filter or classify facts.
Dimensions provide descriptive characteristics about the facts
through their attributes.
The data warehouse designer must define common business
attributes that will be used by the data analyst to narrow
No mathematical limit to the number of dimensions
Slice and dice: focus on slices of the data cube for more detailed
analysis

Question 23

Q

Describe Attribute Hierarchies

Answer

A

Provide top-down data organization
Two purposes:
- Aggregation
- Drill-down/roll-up data analysis
Determine how the data are extracted and represented
Stored in the DBMS’s data dictionary
Used by OLAP tool to access warehouse properly

Question 24

Q

What is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are
normalized?

Answer

A

Snowflake schema

Question 25

Q

Describe Fast Constellation Schema

Answer

A

For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more
star schemes each of them describes facts on another level of
dimension hierarchies).
The fact constellation architecture contains multiple fact tables that
share many dimension tables.

Question 26

Q

Describe a Data Lake

Answer

A

A data lake is a storage repository that holds a vast amount of raw data in its native
format until it is needed.
A data lake includes structured data from relational databases (rows and columns),
semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents,
PDFs) and even binary data (images, audio, video) thus creating a centralized data
store accommodating all forms of data