Data Warehousing Flashcards

1
Q

Data warehouse

A

Decision support database that is maintained separately from the organization’s operational database. It supports information processing by providing a solid platform of consolidated, historical data for analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Subject-Oriented Data Warehouses

A

Organized around major subjects like customers, products and sales. It focuses on modelling and analysis of data for decision makers which is not on daily operations or transaction processing. Provides a simple and concise view around particular subject issues by excluding data that isn’t useful in the decision support process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Integrated Data Warehouses

A

constructed by integrating multiple heterogeneous data sources such as relational DBs, flat files, Excel file and transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures etc. among different data sources. When data is moved to the warehouse it is converted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Time Variant in Data Warehouses

A

the time horizon for data warehouses is much longer than of operational systems. As they provide historical data from the past 5-10 years. Every structure in the data warehouse contains an element of time, explicitly or implicitly, but the key of operational data may or may not contain a time element.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Non-Volatile Data Warehouse

A

Physically separate store of data transformed from the operational environment. Operational update of data doesn’t occur in the data warehouse environment. It doesn’t require transaction processing, recovery and concurrency control mechanisms. It requires only two operations in accessing data: initial loading of data and access of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

heterogeneous DBMS

A

integrates disparate databases to provide a unified query interface for users. This system relies on a query-driven approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

A data warehouse is an _____ driven system. What does it focus on

A

update driven system that integrates and stores information from heterogeneous sources in advance. It focuses on high performance for querying and analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Operational DBMS

A

focuses on Online Transaction Processing (OLTP) while data warehouse focuses on Online Analytical Processing (OLAP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Fact Table

A

Contains measurable data or metrics about business processes ex sales

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dimension Table

A

Contains descriptive attributes that label and filter data ex. Product Names

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why do we need a separate warehouse?

A
  1. Performance: OLTP systems are tuned for transaction processing whilst OLAP is optimized for complex analytical queries, multidimensional views and data aggregation
  2. Historical Data: Operation databases typically lack historical data, which is critical for decision making
  3. Data Consolidation: Decision-making often requires aggregated and summarized data from multiple heterogeneous sources.
  4. Data Quality: Ensures consistency in data representation, codes and formats from diverse sources
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Warehouse Design Overview
Key Design Perspectives

A

1 Top-Down View: Identifies relevant information needed for the data warehouse.
2 Data Source View: Exposes data being captured and managed by operational systems
3 Data Warehouse View: Defines fact tables and dimension tables for analytical use.
4 Business Query View: Represents how end users will query and interact with the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Design Process

A

Top Down: Focuses on overall planning and design
Bottom Up: Starts with prototypes and experiments
Waterfall: Structured, step by step process
Spiral: Iterative development with quick functional prototypes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Design Steps

A

1 Choose a business process: Identify processes to model. Ex orders.
2 Define the grain: decide the atomic level of data granularity (ex. Individual orders or daily summaries)
3 Select Dimensions: Identify descriptive attributes (ex. Time or location)
4 Choose Measures: Define numerical metrics (ex. Sales amount) for the fact table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Enterprise Warehouse

A

Collects all subject info spanning the organization. It features a unified approach for organizing and representing data, the ability to classify data according to subject and give access to those divisions (sales etc.). Normalized design.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DataMart

A

A DataMart is the access layer of the data warehouse environment. Used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts represent small slices of the data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Virtual Warehouse

A

A set of SQL views over operational databases. Only some of the possible summary views may be persisted.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Data Warehouse Tools and Utilities

A
  • Data Extraction: Get data from multiple heterogeneous and external sources
  • Data Cleaning: Detect errors in the data and rectify where possible
  • Data Transformation: Convert data from legacy or host format to warehouse format.
  • Load: Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions.
  • Refresh: Propagate the updates from the data source to the warehouse.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Metadata is the data defining warehouse objects. It stores:

A
  • Description of the structure of the data warehouse (schemas, views…)
  • Operational Metadata (Data lineage, currency of data…)
  • Technical Metadata (DB system names, tables etc…)
  • Algorithms used for summarization
  • The mapping from operational environment to the data warehouse
  • Data related to system performance (Data Warehouse Schema, View…)
  • Business Data (Business terms and definitions, ownership of data…)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OLAP Server Architecture

A
  • ROLAP
  • MOLAP
  • HOLAP
  • Specialized SQL Servers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Relational OLAP (ROLAP)

A

Use relational or extended relational DBMS to store and manage warehouse data and OLAP middleware. Include optimization of DBMS backend, implementation of aggregation based navigation logic and more tools. Scalable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Multidimensional OLAP (MOLAP)

A

Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Hybrid OLAP (HOLAP)

A

Flexibility ex. Low level: relational, high level: array

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Specialized SQL servers

A

ex Redbricks, specialized support for SQL queries over star/snowflake schemas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Data Warehouse Usage

A
  • Information Processing: Supports querying, basic statistical analysis, and reporting
  • Analytical Processing: Multidimensional analysis of data warehouse data. Supports basic OLAP operations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Data Mining:

A

Knowledge discovery from hidden patterns
Supports association, constructing analytical models, performing classification and prediction and presenting the mining results using visualization tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Table Descriptive Metadata

A

Physical Name
Logical Name
Type: Fact, Dimension, Bridge
Role: Legacy, OLTP, Stage
DBMS: DB2, Informix, MSSQL Server, Oracle
Location
Definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Column Descriptive Metadata

A

Physical name
Logical name
Order in table
Datatype
Length
Decimal positions
Nullable/Required
Default Value
Edit Rules
Definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Extract-Transform-Loads (ETL)

A

Extract - The process by which data is extracted from the data source
Transform - The transformation of the source data into a format relevant to the solution
Load - The loading of data into the warehouse
Develop the process backwards, start from business rules and code the ETL routines accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ETL Strengths

A
  • Development Time: Since we are designing output backwards, only relevant data is extracted and processed.
  • Targeted Data: Since we know the outcome of the load process, only presentation-relevant data is present in the warehouse. Also simplifies security regime and administration overhead.
  • Tools Availability: A prolific number of tools are available that provide ETL functionality.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ETL Weaknesses

A
  • Short-sighted requirements analysis: Targeting only relevant data means that future requirements will require changes to current ETL routines, maybe even re-designs.
  • Hardware: Use of ETL processes will require additional hardware to accommodate the requirements of the ETL engine.
  • Learning curve: 3rd party tools imply learning new scripting languages and applications. Lack of experience may lead to choosing difficult approaches.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Extract Load Transform (ELT)

A

Extract the data into a Staging Database
Apply data integrity and business rules check in the Staging Database
Load clean data into the warehouse
Transformations are done within the data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

ELT Benefits

A
  • Removes dependency between the load and transformation process.
  • Extract and load process can include elements that may be needed in the future
  • Break the project into smaller chunks (more predictable and manageable)
  • Data integrity checks are performed earlier, so only cleaned and checked data is in the warehouse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

ELT Strengths

A
  • Project Management: Encourages specific and isolated tasks, therefore the project is manageable. Minimize risks by removing interdependencies between stages, and improving the change and maintenance phases due to process isolation.
  • Flexible and Future Proof: All data from data sources can be loaded in the warehouse. Since transformation is a separate process, it is possible to accommodate future requirements.
  • Utilizing existing hardware: Ability to use the tools provided by ODBMS. Alternatively, ELT can run on the same hardware as the data warehouse.
  • Utilize existing skill set: ELT utilizes the existing tools.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

ELT Weaknesses

A

Against the norm: ELT is an emerging technology, requires a paradigm shift.
Tools available: Not as many ELT tools as ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Data Staging

A

The construction site of the warehouse. Required by most scenarios, connected to a wide variety of sources. Used to clean, aggregate, compute and validate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

2 Models of remote staging

A

Staging area within the data warehouse environment and staging area outside the data warehouse.
Onsite staging model: Data staging area within the operational environment (possibly affecting the operational system).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why do we need data marts?

A

Provide users with flexible access to frequently accessed data
Tailor data to match the specific needs of targeted user groups.
Improve end user response time for queries
Clearly define the scope of potential users, enabling targeted support.
Deliver data in structured forms suitable for end user tools and analytics.
Simplify and accelerate implementation compared to a corporate data warehouse
Lower implementation costs relative to a full-scale data warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Standalone Data Mart

A

Focuses exclusively on a single subject area or department, independent of a broader data warehouse. Collects data from multiple transactional systems for a specific business need. May use dimensional or entity-relationship models. Used by being queried by analytical or business intelligence tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Standalone Data Mart Adv.

A

Quick to build and deliver results
Lower costs compared to corporate data warehouses
Visible results in a short time frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Standalone Data Mart Disadv.

A

Requires multiple ETL processes for each data mart.
Risk of data duplication across marts.
Not future-proof. May not scale if departmental needs grow.
Limited to single department analysis, lacking cross-departmental insight.

42
Q

Dependent Data Mart

A

Derives its data from existing data warehouses, where data is pre aggregated, restructured and summarized before being moved to the data mart. Addresses performance and scalability issues by offloading some processing from the data warehouse.

43
Q

Dependent Data Mart Benefits

A

Resolves performance bottlenecks by offloading data processing to the data mart.
Enhances security through departmental ownership of data marts.
Suitable for tracking KPIs over extended periods of time.

44
Q

Exploiting Data Warehouse Data

A

A data warehouse enables the creation of a wide variety of reports by solving data integration and feed challenges. To transform warehouse data into actionable insight we have to:
Organize data into meaningful models for representation
Develop user specific interfaces to meet diverse business needs

45
Q

Modeling the Data Warehouse

A

1 Define the business model: understand requirements to determine scope and structure.
2 Creating the dimensional model: design star or snowflake schemas for organizing data into facts and dimensions.
3 Modelling summaries: Identify common aggregations to enhance performance.
4 Creating the physical model: Define the physical implementation, including indexing, partitions, and storage optimization.

46
Q

Aggregates

A

Operations performed on a dataset to return a single value. They are crucial for summarizing large datasets in data warehouses and are commonly used in OLAP operations. Examples:
Sum: Total of all values
Count: Number of records
Average: Mean value
Min/max: Minimum or maximum value

47
Q

Relational OLAP: Dimensional Modelling using an RDBMS

A

Special schema design: Star, Snowflake
Special Indexes: Bitmap, Multi-Table join
Proven Technologies (RDBMS) tend to outperform specialized MDDB especially on large data sets.
Products such as IBM DB2, Oracle, Redbrick etc.

48
Q

Star Schema

A

A single fact table with detail and summary data. Fact table primary key has only one key per column dimension. Each key is generated. Each dimension is a single table, highly denormalized.

49
Q

Star Schema Benefits

A

Easy to understand, easy to define hierarchies, reduces number of physical joins, low maintenance, very simple metadata.

50
Q

Snowflake Schema

A

A normalized star schema.
Components: A Fact table with metric attributes and the foreign keys necessary to join the dimension tables, and dimension tables with reference attributes and metric attributes (normalized).

51
Q

Snowflake Schema Adv

A

May improve performance as smaller tables are joined. Easier to maintain and greater flexibility.

52
Q

Snowflake Schema Disadv

A

Increases the number of tables to work with. Increases complexity of queries.

53
Q

Star vs Snowflake Schema

A

Snowflake is easier to maintain because it has no redundancy.
Snowflake is less easy to understand as queries are more complex.
Snowflake has more foreign keys, hence longer query execution time
Snowflake has a higher number of joins.
Star only has a single dimension table for each dimension.
Star should be used when dimension table contains less number of rows. Snowflake better for larger dimension tables.
Dimension tables are normalized in snowflake but denormalized in star. Fact tables always denormalized.
Snowflake is bottom up, star is top down.

54
Q

MDDB

A

A special-purpose data model. Facts stored in multi-dimensional arrays. Dimensions used to index array, sometimes on top of relational DB.

55
Q

MOLAP: Dimensional Modelling using the multi-dimensional model

A

The application layer of MOLAP stores data in a multidimensional structure
The presentation layer provides the multidimensional view
Efficient storage and processing
Complexity hidden from user but not developer.
Analysis using pre-aggregated summaries and precalculated measures.

56
Q

Important Points about MOLAP

A

Precalculated or pre-consolidated transactional data improves speed
However, MDDNs require an enormous amount of overhead in processing time and storage.
Application design is the definition of dimensions and calculation rules.

57
Q

Hybrid OLAP (HOLAP) Advantages

A

Best of both worlds
Storing detailed data in RDBMS
Storing aggregate data in MDBMS
User access via MOLAP tools

58
Q

Technology Choice

A

Performance: How fast will the system appear to the end user? Key point.
Data volume and scalability: While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of GBs and TBs.

59
Q

Use RDBMS/ROLAP solution for data mart if:

A

Data is over 100GB
Read only
Historical data at lowest level of granularity
Detailed access, long-running queries
Data assigned to lowest level elements

60
Q

Use MDD/MOLAP solution for data mart if:

A

You require write access
Your data is under 50GB
Time to implement is 60-90 days
Lowest level already aggregated
Data access on aggregated level
Developing a general purpose application for inventory movement or asset management

61
Q

Use HOLAP for data mart if:

A

OLAP on aggregated and detailed data
Different user groups
Ease of use and detailed data

62
Q

Implementing the Data Warehouse

A

Four main processes involved in implementing a data warehouse:
Monitor: send data from sources
Integrate: Loading, Cleansing
Process: Query processing, indexing
Manage: metadata, design

63
Q

Monitoring

A

Source types such as relational, flat file etc. Monitoring techniques used such as:
Periodic Snapshots
Database Triggers
Log shipping
Data shipping
Transaction shipping
Polling
Application level monitoring

64
Q

Issues with monitoring

A

Frequency: Periodic or triggered on event
Data transformation: Convert data to uniform format, remove & add fields
Standards ex. ODBC.

65
Q

Integration

A

Data cleaning, data loading and derived data.

66
Q

Reasons for dirty data:

A

Dummy values
Absence of data
Multipurpose fields
Cryptic data
Contradicting data
Inappropriate use of address lines
Violation of business rules
Reused primary keys
Non-unique identifiers
Data Integration problems

67
Q

Data Cleaning

A

Migration ex, yen to dollars
Scrubbing: using domain specific knowledge
Fusion ex. Mail list, customer merging
Auditing: discover rules and relationships (like data mining)

68
Q

Parsing

A

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Ex. Parsing City and State.

69
Q

Correcting

A

Correction parsed individual data components using sophisticated data algorithms and secondary data sources. Ex. Adding a zip code.

70
Q

Standardizing

A

Applies conversion routines to transform data into its preferred format using both standard and custom business rules, ex. Adding a pre name.

71
Q

Matching

A

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Ex. identifying similar names and addresses.

72
Q

Consolidating

A

Analyzing and identifying relationships between matched records and consolidating/merging them into one representation.

73
Q

Householding

A

Identifying all members of a household. Ensures that you are able to target one household (ex in marketing). Can lead to substantial savings.

74
Q

Enrichment

A

Bring data from external sources to augment/enrich operational data.

75
Q

Scoring

A

Computation of a probability of an event ex. Chance a customer will defect to a different brand.

76
Q

Loading Data

A

Physically move to the data warehouse
Incremental vs refresh
Offline vs Online
Frequency of loading

77
Q

Refresh Techniques

A

Full extract from base tables. Read entire source table is too expensive, but may be the only choice for legacy systems.

78
Q

Detecting changes

A

Create a snapshot log table to record ids of updated rows of source data and timestamps. Detect changes by defining after row triggers to update snapshot log when source table changes and by using regular transaction log to detect changes to source data.

79
Q

Index Structures

A

Tradition Access Methods: B-Trees, Hash tables, R-Trees
Popular in Warehouses: Inverted index, Bitmap Indexes, Join indexes

80
Q

Inverted Indexes

A

Each index term is associated with an inverted list, contains lists of documents and lists of word occurrences in documents aside other info. Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointer. Each document in the collection is given a unique number. Lists are usually document-oriented (sorted by document number).

81
Q

Bitmap Indexes

A

A collection of bitmaps, one for each distinct value of the column. Each bitmap has N bits where N is the number of rows in the table. A bit corresponding to a value v for a row r is set iff r has the value for the indexed attribute. Good if domain cardinality is small. Bit vectors can be compressed (30:1).

82
Q

Join Indexes

A

Pre-computed joins. A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute. Ex. A join index on city dimension of calls fact table. Correlates for each city the calls from that city. Join indexes can also span multiple dimension tables ex. A join index on city and time dimension of fact table.

83
Q

Volume Analysis

A

Estimates of average and max values for object sizes. Number of instances per entity. Realistic storage requirements (day one and growth over time). Headings: Row max, Row Avg, Record size, table max, table avg.

84
Q

Usage Analysis (Internal Constraints)

A

Identify major transactions to hit each table and their frequency.
Insertions/updates/ retrievals/deletes or a mixture of all 4.
Compute volatile (transactions) vs non-volatile tables (reference tables) (Turnover/population ex. Over a year).

85
Q

Transactional Analysis (Other Constraints)

A

Documenting critical transactions to hit DB.
Impact on systems resources.
Objects accessed and type of access.
Any constraints placed on processing (ex. time)

86
Q

Access Requirements

A

type of access + type of usage = headache. Ex. Indexes both increase and decrease performance. Time sharing or avoidance (forbidden queries).

87
Q

Integrity Analysis

A

Traditional DB Territory:
Entity Constraints (Primary key - no duplicated)
Referential Integrity (Foreign key ref is not null and valid)
Domain Constraints (Credit limit numeric + accepted value [1000,2000,5000]).
Enforcing these is a big hit on performance which cascades upwards.

88
Q

Security Analysis

A

Outside frame of DB is physical access to building, logon procedures etc. It is difficult to contemplate DB security on its own ex. Operating systems level. Large scale DBMSs provide specific utilities (Preventing unauthorized access to data & preventing usage of DBMS functions). Use usage analysis to derive user profiles with associated access rights.

89
Q

Data Warehouse and Security

A

A data warehouse by nature is an open, accessible system
The aim of a data warehouse is to make large amounts of data easily accessible to users, this enables them to extract information about the business as a whole
Any security restrictions can be seen as obstacles to that goal
They can become constraints on the design of the warehouse

90
Q

Security Requirements

A

Describe all security conditions that have to be considered in a data warehouse environment. The first step for this is to classify the security objects and security subjects of the environment. Security objects can be classified by sensitivity (public, confidential) or job function (accounting, personnel).

91
Q

Legal Requirements

A

Which arrangements have to be made to be allowed to hold legally sensitive data? Which data is subject to legal restrictions? Which separate handling does this data require concerning storage, access and maintenance? Which analysis may be performed on this data? If data is held online for trend analysis in summarized form, do legal restrictions apply?

92
Q

Audit Requirements

A

Security audit information is the basis for further reviews and examinations, this is done to test the adequacy of system controls and to recommend any changes in the security policy.
The following activities are interesting or security audits; Connections, Disconnections, Access to data, Change of data, Deletion of data.

93
Q

Network Requirements

A

Important part of security requirements. The transfer of data from the source system (usually an OS) into a data warehouse. Mostly transmitted over a network. Precautions must be taken in order to maintain the confidentiality and integrity of the data.

94
Q

Security Model Requirements

A

Data Filtration and encryption prior to data warehouse.
Classical Security Model
Hybrid
View Based

95
Q

Classic Security Model

A

Covers all stages, from requirement analysis to the physical schema. During this process flow, security measures are applied so that the proper and secure data warehouse schema can be developed. This may be fruitful if implemented in transactional databases, but as far as data warehouses are concerned it may be unsuitable.

96
Q

Password Management

A

Accounts get locked after failed attempts
Password history so passwords cannot be reused
Characteristics of user’s accounts are set in profile (Password lifetime, number of failed attempts, frequency of changes, minimum length).

97
Q

Standard Roles

A

Before the users are active, we need to assign roles.
CONNECT: access to DB and any other granted rights
RESOURCE: creation of objects
DBA: unlimited space quotas and grant privileges to other users.

98
Q

Grant Role

A

CONNECT and RESOURCE don’t automatically give access to other people’s resources.
Need to grant access.
Object privilege = Select, Insert, Update, Delete
Can be granted on certain columns and not on others

99
Q

Security Layers

A

There is an old debate regarding the best level at which to implement security. Databases contain data of increasing value. They are increasingly open to the outside.
Possible layers of security: By user, By table (using policies), At row level

100
Q

Three Aspects of Security

A

Confidentiality: Information is not disclosed to unauthorized users
Integrity: Modifications of data must be legitimate
Availability: Authorized users should not be refused access (may happen by mistake)

101
Q

Costs of Security

A

Significant performance impact
Complexity due to multiple levels
Possible errors from conflicting policies
Combination of measure may also have unpredictable effects on user’s queries
May conflict with constraints set across tables (such as referential integrity)