Mid Term Flashcards

1
Q

Data Mart Advantages

A

Faster implementation

Earlier return on investment

Less risk of failure

Gives the project team time to learn and grow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Warehouse Advantages

A

A truly corporate effort, an enterprise view of data

Inherently architected, not a union of disparate data marts

Single, central storage of data about the content

Centralized rules and control

May see quick results if implemented with iterations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Warehouse Disadvantages

A

Takes longer to build even with an iterative method

High exposure to risk of failure

Needs high level of cross-functional skills,

High outlay without proof of concept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data Mart/Bottom-Up Disadvantages

A

Each data mart has its own narrow view of data

Permeates redundant data in every data mart

Perpetuates inconsistent and irreconcilable data

Proliferates unmanageable interfaces (Not required)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Phases of data warehouse development lifecycle.

A

Planning, Analysis, Design, Implementation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the most costly lifecycle activity

A

ETL - You spend at least 75% of your time on that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does ETL stand for?

A

Extract, Transform, Load

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which Data Warehouse Lifecycle phase is analogous to logical design?

A

Dimensional modeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which two data warehouse lifecycle phases have parallel levels of activities?

A

Design and implementation, which are the last two phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The data warehouse architecture that does not rely on the creation of a new data structure is the what?

A

Federated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a data mart?

A

A data mart is a subset of a data warehouse focused on a particular line of business, department, or subject area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a data warehouse?

A

Central repositories of integrated data from one or more disparate sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What activities are required in the analysis phase?

A

Analytic themes, which breaks requirements into categories;

bus matrix, which is the business processes mapped to the data needed;

Information packet development, maps which dimensions are needed for each fact table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the different types of data architectures, and what do they mean?

A

Federated: a collection of independent database systems are united into a loosely coupled federation in order to share and exchange information.

Hub and Spoke: The communication is not made between pairs of applications but between each application (spoke) and the central hub.

Data Mart Bus: A bus architecture is composed of a set of tightly integrated data marts that get their power from conformed dimensions and fact tables. A bus architecture uses top-down planning and a grid of business functions and dimensions to deliver a set of tightly integrated data marts.

Centralized Enterprise-Wide: A centralized data architecture means the data from each domain/subject (i.e. payroll, operations, finance) is copied to one location (i.e. a data lake under one storage account), and that the data from the multiple domains/subjects are combined to create centralized data models and unified views.

Independent Data Marts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are junk dimensions?

A

Junk dimensions are used for dimensions that only involve a few responses, and which don’t fit squarely in any other dimensions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a degenerate dimension?

A

When dimensions are stored in a fact table because there are so many.

17
Q

What are different options for slowly changing dimensions?

A

Zero option: do nothing, which is easiest, but you have inaccurate data.

One option: overwrite it.

Two: you retain the former value, but you create a new row. Problematic because you can’t duplicate a surrogate key. So instead, you have to identify the new information via a business key.

Three: instead of creating a new row, you create an additional column for “current” data.

18
Q

What are the basic dimensional model schemas?

A

Star schema and snowflake schema.

19
Q

What is Business Intelligence?

A

Business intelligence encompasses both the applications and methods used for the gathering, cleansing and storing of data, as well as the transformation of that data into usable knowledge capable of informing key business decisions and strategies.

20
Q

What is integrated data?

A

Integrated data refers to data that is pulled from many disparate data sources and organized to depict a useful, unified view of the business or parts of the business. The practice of data integration refers to the act of pulling data together to construct a unified view of the business data, and it can be done in multiple different ways.

21
Q

Who was Ralph Kimball?

A

Ralph Kimball is a top authority within the study of data warehousing, and a proponent of the bottom-up approach to data warehousing described above. He published the book “The Data Warehousing Toolkit” in 1996.

22
Q

What occurs in the Data Warehousing planning phase?

A

Define Initial project scope, Project Goals/Objectives, Project Success

Organize Team, Team roles, Team members

Organize Project, Project cost, Project schedule

23
Q

What are the requirements of Data Warehousing analysis?

A

Interview Summaries - Prose summarizing interviews

Analytics - Analysis requirements grouped into “categories”

DW Bus Matrix - Business processes mapped to data needed (dimensions used, eg time, student)

DM Information Package - Prioritized processes

24
Q

What is a degenerate dimension?

A

An attribute (dimension) stored in fact table

Attribute does NOT link to a dimension table

Typically a high-cardinality attribute

Typically groups records

Examples: PO#, Order#, Shipment#, …
Often used for drill-downs and/or data mining (e.g. Market Basket Analysis)

25
Q

What are the two basic dimensional model schemas?

A

Snowflake Schema and Star Schema

26
Q

What is a snowflake schema?

A

Fact table has direct relationship with some dimension tables, and indirect relationship with other(s)

Multi-table dimensions i.e., “Normalized” dimensions

Advantages: Good when large number of rows in Dims, Avoids data redundancy, Ease of maintaining

27
Q

What is a star schema?

A

Fact table has direct relationship with each of the dimension tables

“Single-table” dimensions i.e., denormalized

Advantages: Faster load/query/analysis performance, Less complexity; more intuitive to users, Good when relatively few rows in Dims

28
Q

What options exist for slowly changing dimensions?

A
  1. Do Nothing
  2. Overwrite Record
  3. Retain All History (add new rows) – new SK
  4. Retain Some History (add new columns)
29
Q

What are the key steps/components of the ETL process?

A

Extract: Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW

Transform: Cleaning and converting data to prepare it for loading into the DW

Load: Putting cleansed and converted data into the DW

30
Q

What is the staging area and why is it important?

A

Information hub, facilitating the enriching stages that data goes through to populate a DW

Advantages: Separates source systems and DW, Minimizes ETL impact on source AND DW systems, Potential added support for operational reporting
e.g., ODS

Can consist of multiple “hubs”: “upload” area, “staging” area(s) , “DW load images”

31
Q

Describe common types of data transformations that occur during ETL

A

Format Revisions (Varchar (2) varchar (30))

Key Restructuring, Lookup (SK’s, AK’s)

Handling of Null Values (“unknown”; “-1” rows)

Decoding fields (m/f male/female)

Calculated, Derived values (FName + LName AS
Fullname)

Merging of Data (DimClass table = CLASS fields + COURSE fields)

Splitting of single fields (Addr Street, City, State, Zip)

Character set conversion (Ascii Unicode)

Units of measurement conversion (Meters Yards)

Date/time conversion (10/2/2022 02-OCT-2022)

Summarization (individualItemSale salesByProduct)

Deduplication (eliminating duplicated record)

32
Q

What is data profiling and why is it used?

A

Systematic analysis of the content of a data source
Profile each source table/dataset

Goals: Anticipate potential data quality issues upfront, Build quality corrections and controls into ETL process, Manual and/or Tool-assisted (SSIS Data Profiler)

33
Q

What are two ways to document ETL design?

A

Physical DW Design spreadsheet

ETL Map

34
Q

What are the two categories of extracting source data?

A

Static Data Capture - Point-in-time snapshot, Initial Loads and periodic refreshes

Revised Data Capture - Only data that has been added, updated, deleted since last load; Ongoing incremental loads - Updates existing data

35
Q

What are commonly used SQL commands for creating and transforming data?

A

INSERT…SELECT [DISTINCT]…
UPDATE…FROM…
CASE…WHEN… (can be used to expand values)s

36
Q

What is a multi-dimensional (OLAP) cube?

A

An OLAP (Online analytical processing) cube is a multi-dimensional array of data.

An OLAP Cube is a data structure that allows fast analysis of data according to the multiple Dimensions that define a business problem. A multidimensional cube for reporting sales might be, for example, composed of 7 Dimensions: Salesperson, Sales Amount, Region, Product, Region, Month, Year.