Data Warehousing Flashcards

Question 1

Q

Data warehouse

Answer

A

Decision support database that is maintained separately from the organization’s operational database. It supports information processing by providing a solid platform of consolidated, historical data for analysis.

Question 2

Q

Subject-Oriented Data Warehouses

Answer

A

Organized around major subjects like customers, products and sales. It focuses on modelling and analysis of data for decision makers which is not on daily operations or transaction processing. Provides a simple and concise view around particular subject issues by excluding data that isn’t useful in the decision support process.

Question 3

Q

Integrated Data Warehouses

Answer

A

constructed by integrating multiple heterogeneous data sources such as relational DBs, flat files, Excel file and transaction records. Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attribute measures etc. among different data sources. When data is moved to the warehouse it is converted.

Question 4

Q

Time Variant in Data Warehouses

Answer

A

the time horizon for data warehouses is much longer than of operational systems. As they provide historical data from the past 5-10 years. Every structure in the data warehouse contains an element of time, explicitly or implicitly, but the key of operational data may or may not contain a time element.

Question 5

Q

Non-Volatile Data Warehouse

Answer

A

Physically separate store of data transformed from the operational environment. Operational update of data doesn’t occur in the data warehouse environment. It doesn’t require transaction processing, recovery and concurrency control mechanisms. It requires only two operations in accessing data: initial loading of data and access of data.

Question 6

Q

heterogeneous DBMS

Answer

A

integrates disparate databases to provide a unified query interface for users. This system relies on a query-driven approach

Question 7

Q

A data warehouse is an _____ driven system. What does it focus on

Answer

A

update driven system that integrates and stores information from heterogeneous sources in advance. It focuses on high performance for querying and analysis

Question 8

Q

Operational DBMS

Answer

A

focuses on Online Transaction Processing (OLTP) while data warehouse focuses on Online Analytical Processing (OLAP)

Question 9

Q

Fact Table

Answer

A

Contains measurable data or metrics about business processes ex sales

Question 10

Q

Dimension Table

Answer

A

Contains descriptive attributes that label and filter data ex. Product Names

Question 11

Q

Why do we need a separate warehouse?

Answer

A

Performance: OLTP systems are tuned for transaction processing whilst OLAP is optimized for complex analytical queries, multidimensional views and data aggregation
Historical Data: Operation databases typically lack historical data, which is critical for decision making
Data Consolidation: Decision-making often requires aggregated and summarized data from multiple heterogeneous sources.
Data Quality: Ensures consistency in data representation, codes and formats from diverse sources

Question 12

Q

Data Warehouse Design Overview
Key Design Perspectives

Answer

A

1 Top-Down View: Identifies relevant information needed for the data warehouse.
2 Data Source View: Exposes data being captured and managed by operational systems
3 Data Warehouse View: Defines fact tables and dimension tables for analytical use.
4 Business Query View: Represents how end users will query and interact with the data

Question 13

Q

Design Process

Answer

A

Top Down: Focuses on overall planning and design
Bottom Up: Starts with prototypes and experiments
Waterfall: Structured, step by step process
Spiral: Iterative development with quick functional prototypes.

Question 14

Q

Design Steps

Answer

A

1 Choose a business process: Identify processes to model. Ex orders.
2 Define the grain: decide the atomic level of data granularity (ex. Individual orders or daily summaries)
3 Select Dimensions: Identify descriptive attributes (ex. Time or location)
4 Choose Measures: Define numerical metrics (ex. Sales amount) for the fact table.

Question 15

Q

Enterprise Warehouse

Answer

A

Collects all subject info spanning the organization. It features a unified approach for organizing and representing data, the ability to classify data according to subject and give access to those divisions (sales etc.). Normalized design.

Question 16

Q

DataMart

Answer

A

A DataMart is the access layer of the data warehouse environment. Used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts represent small slices of the data warehouse.

Question 17

Q

Virtual Warehouse

Answer

A

A set of SQL views over operational databases. Only some of the possible summary views may be persisted.

Question 18

Q

Data Warehouse Tools and Utilities

Answer

A

Data Extraction: Get data from multiple heterogeneous and external sources
Data Cleaning: Detect errors in the data and rectify where possible
Data Transformation: Convert data from legacy or host format to warehouse format.
Load: Sort, summarize, consolidate, compute views, check integrity, and build indices and partitions.
Refresh: Propagate the updates from the data source to the warehouse.

Question 19

Q

Metadata is the data defining warehouse objects. It stores:

Answer

A

Description of the structure of the data warehouse (schemas, views…)
Operational Metadata (Data lineage, currency of data…)
Technical Metadata (DB system names, tables etc…)
Algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance (Data Warehouse Schema, View…)
Business Data (Business terms and definitions, ownership of data…)

Question 20

Q

OLAP Server Architecture

Answer

A

ROLAP
MOLAP
HOLAP
Specialized SQL Servers

Question 21

Q

Relational OLAP (ROLAP)

Answer

A

Use relational or extended relational DBMS to store and manage warehouse data and OLAP middleware. Include optimization of DBMS backend, implementation of aggregation based navigation logic and more tools. Scalable.

Question 22

Q

Multidimensional OLAP (MOLAP)

Answer

A

Sparse array-based multidimensional storage engine. Fast indexing to pre-computed summarized data.

Question 23

Q

Hybrid OLAP (HOLAP)

Answer

A

Flexibility ex. Low level: relational, high level: array

Question 24

Q

Specialized SQL servers

Answer

A

ex Redbricks, specialized support for SQL queries over star/snowflake schemas.

Question 25

Q

Data Warehouse Usage

Answer

A

Information Processing: Supports querying, basic statistical analysis, and reporting
Analytical Processing: Multidimensional analysis of data warehouse data. Supports basic OLAP operations.

Question 26

Q

Data Mining:

Answer

A

Knowledge discovery from hidden patterns
Supports association, constructing analytical models, performing classification and prediction and presenting the mining results using visualization tools.

Question 27

Q

Table Descriptive Metadata

Answer

A

Physical Name
Logical Name
Type: Fact, Dimension, Bridge
Role: Legacy, OLTP, Stage
DBMS: DB2, Informix, MSSQL Server, Oracle
Location
Definition

Question 28

Q

Column Descriptive Metadata

Answer

A

Physical name
Logical name
Order in table
Datatype
Length
Decimal positions
Nullable/Required
Default Value
Edit Rules
Definition

Question 29

Q

Extract-Transform-Loads (ETL)

Answer

A

Extract - The process by which data is extracted from the data source
Transform - The transformation of the source data into a format relevant to the solution
Load - The loading of data into the warehouse
Develop the process backwards, start from business rules and code the ETL routines accordingly.

Question 30

Q

ETL Strengths

Answer

A

Development Time: Since we are designing output backwards, only relevant data is extracted and processed.
Targeted Data: Since we know the outcome of the load process, only presentation-relevant data is present in the warehouse. Also simplifies security regime and administration overhead.
Tools Availability: A prolific number of tools are available that provide ETL functionality.

Question 31

Q

ETL Weaknesses

Answer

A

Short-sighted requirements analysis: Targeting only relevant data means that future requirements will require changes to current ETL routines, maybe even re-designs.
Hardware: Use of ETL processes will require additional hardware to accommodate the requirements of the ETL engine.
Learning curve: 3rd party tools imply learning new scripting languages and applications. Lack of experience may lead to choosing difficult approaches.

Question 32

Q

Extract Load Transform (ELT)

Answer

A

Extract the data into a Staging Database
Apply data integrity and business rules check in the Staging Database
Load clean data into the warehouse
Transformations are done within the data warehouse

Question 33

Q

ELT Benefits

Answer

A

Removes dependency between the load and transformation process.
Extract and load process can include elements that may be needed in the future
Break the project into smaller chunks (more predictable and manageable)
Data integrity checks are performed earlier, so only cleaned and checked data is in the warehouse

Question 34

Q

ELT Strengths

Answer

A

Project Management: Encourages specific and isolated tasks, therefore the project is manageable. Minimize risks by removing interdependencies between stages, and improving the change and maintenance phases due to process isolation.
Flexible and Future Proof: All data from data sources can be loaded in the warehouse. Since transformation is a separate process, it is possible to accommodate future requirements.
Utilizing existing hardware: Ability to use the tools provided by ODBMS. Alternatively, ELT can run on the same hardware as the data warehouse.
Utilize existing skill set: ELT utilizes the existing tools.

Question 35

Q

ELT Weaknesses

Answer

A

Against the norm: ELT is an emerging technology, requires a paradigm shift.
Tools available: Not as many ELT tools as ETL

Question 36

Q

Data Staging

Answer

A

The construction site of the warehouse. Required by most scenarios, connected to a wide variety of sources. Used to clean, aggregate, compute and validate data

Question 37

Q

2 Models of remote staging

Answer

A

Staging area within the data warehouse environment and staging area outside the data warehouse.
Onsite staging model: Data staging area within the operational environment (possibly affecting the operational system).

Question 38

Q

Why do we need data marts?

Answer

A

Provide users with flexible access to frequently accessed data
Tailor data to match the specific needs of targeted user groups.
Improve end user response time for queries
Clearly define the scope of potential users, enabling targeted support.
Deliver data in structured forms suitable for end user tools and analytics.
Simplify and accelerate implementation compared to a corporate data warehouse
Lower implementation costs relative to a full-scale data warehouse.

Question 39

Q

Standalone Data Mart

Answer

A

Focuses exclusively on a single subject area or department, independent of a broader data warehouse. Collects data from multiple transactional systems for a specific business need. May use dimensional or entity-relationship models. Used by being queried by analytical or business intelligence tools.

Question 40

Q

Standalone Data Mart Adv.

Answer

A

Quick to build and deliver results
Lower costs compared to corporate data warehouses
Visible results in a short time frame

Question 41

Q

Standalone Data Mart Disadv.

Answer

A

Requires multiple ETL processes for each data mart.
Risk of data duplication across marts.
Not future-proof. May not scale if departmental needs grow.
Limited to single department analysis, lacking cross-departmental insight.

Question 42

Q

Dependent Data Mart

Answer

A

Derives its data from existing data warehouses, where data is pre aggregated, restructured and summarized before being moved to the data mart. Addresses performance and scalability issues by offloading some processing from the data warehouse.

Question 43

Q

Dependent Data Mart Benefits

Answer

A

Resolves performance bottlenecks by offloading data processing to the data mart.
Enhances security through departmental ownership of data marts.
Suitable for tracking KPIs over extended periods of time.

Question 44

Q

Exploiting Data Warehouse Data

Answer

A

A data warehouse enables the creation of a wide variety of reports by solving data integration and feed challenges. To transform warehouse data into actionable insight we have to:
Organize data into meaningful models for representation
Develop user specific interfaces to meet diverse business needs

Question 45

Q

Modeling the Data Warehouse

Answer

A

1 Define the business model: understand requirements to determine scope and structure.
2 Creating the dimensional model: design star or snowflake schemas for organizing data into facts and dimensions.
3 Modelling summaries: Identify common aggregations to enhance performance.
4 Creating the physical model: Define the physical implementation, including indexing, partitions, and storage optimization.

Question 46

Q

Aggregates

Answer

A

Operations performed on a dataset to return a single value. They are crucial for summarizing large datasets in data warehouses and are commonly used in OLAP operations. Examples:
Sum: Total of all values
Count: Number of records
Average: Mean value
Min/max: Minimum or maximum value

Question 47

Q

Relational OLAP: Dimensional Modelling using an RDBMS

Answer

A

Special schema design: Star, Snowflake
Special Indexes: Bitmap, Multi-Table join
Proven Technologies (RDBMS) tend to outperform specialized MDDB especially on large data sets.
Products such as IBM DB2, Oracle, Redbrick etc.

Question 48

Q

Star Schema

Answer

A

A single fact table with detail and summary data. Fact table primary key has only one key per column dimension. Each key is generated. Each dimension is a single table, highly denormalized.

Question 49

Q

Star Schema Benefits

Answer

A

Easy to understand, easy to define hierarchies, reduces number of physical joins, low maintenance, very simple metadata.

Question 50

Q

Snowflake Schema

Answer

A

A normalized star schema.
Components: A Fact table with metric attributes and the foreign keys necessary to join the dimension tables, and dimension tables with reference attributes and metric attributes (normalized).

Question 51

Q

Snowflake Schema Adv

Answer

A

May improve performance as smaller tables are joined. Easier to maintain and greater flexibility.

Question 52

Q

Snowflake Schema Disadv

Answer

A

Increases the number of tables to work with. Increases complexity of queries.

Question 53

Q

Star vs Snowflake Schema

Answer

A

Snowflake is easier to maintain because it has no redundancy.
Snowflake is less easy to understand as queries are more complex.
Snowflake has more foreign keys, hence longer query execution time
Snowflake has a higher number of joins.
Star only has a single dimension table for each dimension.
Star should be used when dimension table contains less number of rows. Snowflake better for larger dimension tables.
Dimension tables are normalized in snowflake but denormalized in star. Fact tables always denormalized.
Snowflake is bottom up, star is top down.

Question 54

Q

MDDB

Answer

A

A special-purpose data model. Facts stored in multi-dimensional arrays. Dimensions used to index array, sometimes on top of relational DB.

Question 55

Q

MOLAP: Dimensional Modelling using the multi-dimensional model

Answer

A

The application layer of MOLAP stores data in a multidimensional structure
The presentation layer provides the multidimensional view
Efficient storage and processing
Complexity hidden from user but not developer.
Analysis using pre-aggregated summaries and precalculated measures.

Question 56

Q

Important Points about MOLAP

Answer

A

Precalculated or pre-consolidated transactional data improves speed
However, MDDNs require an enormous amount of overhead in processing time and storage.
Application design is the definition of dimensions and calculation rules.

Question 57

Q

Hybrid OLAP (HOLAP) Advantages

Answer

A

Best of both worlds
Storing detailed data in RDBMS
Storing aggregate data in MDBMS
User access via MOLAP tools

Question 58

Q

Technology Choice

Answer

A

Performance: How fast will the system appear to the end user? Key point.
Data volume and scalability: While MDD servers can handle up to 50GB of storage, RDBMS servers can handle hundreds of GBs and TBs.

Question 59

Q

Use RDBMS/ROLAP solution for data mart if:

Answer

A

Data is over 100GB
Read only
Historical data at lowest level of granularity
Detailed access, long-running queries
Data assigned to lowest level elements

Question 60

Q

Use MDD/MOLAP solution for data mart if:

Answer

A

You require write access
Your data is under 50GB
Time to implement is 60-90 days
Lowest level already aggregated
Data access on aggregated level
Developing a general purpose application for inventory movement or asset management

Question 61

Q

Use HOLAP for data mart if:

Answer

A

OLAP on aggregated and detailed data
Different user groups
Ease of use and detailed data

Question 62

Q

Implementing the Data Warehouse

Answer

A

Four main processes involved in implementing a data warehouse:
Monitor: send data from sources
Integrate: Loading, Cleansing
Process: Query processing, indexing
Manage: metadata, design

Question 63

Q

Monitoring

Answer

A

Source types such as relational, flat file etc. Monitoring techniques used such as:
Periodic Snapshots
Database Triggers
Log shipping
Data shipping
Transaction shipping
Polling
Application level monitoring

Question 64

Q

Issues with monitoring

Answer

A

Frequency: Periodic or triggered on event
Data transformation: Convert data to uniform format, remove & add fields
Standards ex. ODBC.

Answer 65

A

Data cleaning, data loading and derived data.

Answer 66

A

Dummy values
Absence of data
Multipurpose fields
Cryptic data
Contradicting data
Inappropriate use of address lines
Violation of business rules
Reused primary keys
Non-unique identifiers
Data Integration problems

Answer 67

A

Migration ex, yen to dollars
Scrubbing: using domain specific knowledge
Fusion ex. Mail list, customer merging
Auditing: discover rules and relationships (like data mining)

Answer 68

A

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Ex. Parsing City and State.

Answer 69

A

Correction parsed individual data components using sophisticated data algorithms and secondary data sources. Ex. Adding a zip code.

Answer 70

A

Applies conversion routines to transform data into its preferred format using both standard and custom business rules, ex. Adding a pre name.

Answer 71

A

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Ex. identifying similar names and addresses.

Answer 72

A

Analyzing and identifying relationships between matched records and consolidating/merging them into one representation.

Answer 73

A

Identifying all members of a household. Ensures that you are able to target one household (ex in marketing). Can lead to substantial savings.

Answer 74

A

Bring data from external sources to augment/enrich operational data.

Answer 75

A

Computation of a probability of an event ex. Chance a customer will defect to a different brand.

Answer 76

A

Physically move to the data warehouse
Incremental vs refresh
Offline vs Online
Frequency of loading

Answer 77

A

Full extract from base tables. Read entire source table is too expensive, but may be the only choice for legacy systems.

Answer 78

A

Create a snapshot log table to record ids of updated rows of source data and timestamps. Detect changes by defining after row triggers to update snapshot log when source table changes and by using regular transaction log to detect changes to source data.

Answer 79

A

Tradition Access Methods: B-Trees, Hash tables, R-Trees
Popular in Warehouses: Inverted index, Bitmap Indexes, Join indexes

Answer 80

A

Each index term is associated with an inverted list, contains lists of documents and lists of word occurrences in documents aside other info. Each entry is called a posting. The part of the posting that refers to a specific document or location is called a pointer. Each document in the collection is given a unique number. Lists are usually document-oriented (sorted by document number).

Answer 81

A

A collection of bitmaps, one for each distinct value of the column. Each bitmap has N bits where N is the number of rows in the table. A bit corresponding to a value v for a row r is set iff r has the value for the indexed attribute. Good if domain cardinality is small. Bit vectors can be compressed (30:1).

Answer 82

A

Pre-computed joins. A join index between a fact table and a dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute. Ex. A join index on city dimension of calls fact table. Correlates for each city the calls from that city. Join indexes can also span multiple dimension tables ex. A join index on city and time dimension of fact table.

Answer 83

A

Estimates of average and max values for object sizes. Number of instances per entity. Realistic storage requirements (day one and growth over time). Headings: Row max, Row Avg, Record size, table max, table avg.

Answer 84

A

Identify major transactions to hit each table and their frequency.
Insertions/updates/ retrievals/deletes or a mixture of all 4.
Compute volatile (transactions) vs non-volatile tables (reference tables) (Turnover/population ex. Over a year).

Answer 85

A

Documenting critical transactions to hit DB.
Impact on systems resources.
Objects accessed and type of access.
Any constraints placed on processing (ex. time)

Answer 86

A

type of access + type of usage = headache. Ex. Indexes both increase and decrease performance. Time sharing or avoidance (forbidden queries).

Answer 87

A

Traditional DB Territory:
Entity Constraints (Primary key - no duplicated)
Referential Integrity (Foreign key ref is not null and valid)
Domain Constraints (Credit limit numeric + accepted value [1000,2000,5000]).
Enforcing these is a big hit on performance which cascades upwards.

Answer 88

A

Outside frame of DB is physical access to building, logon procedures etc. It is difficult to contemplate DB security on its own ex. Operating systems level. Large scale DBMSs provide specific utilities (Preventing unauthorized access to data & preventing usage of DBMS functions). Use usage analysis to derive user profiles with associated access rights.

Answer 89

A

A data warehouse by nature is an open, accessible system
The aim of a data warehouse is to make large amounts of data easily accessible to users, this enables them to extract information about the business as a whole
Any security restrictions can be seen as obstacles to that goal
They can become constraints on the design of the warehouse

Answer 90

A

Describe all security conditions that have to be considered in a data warehouse environment. The first step for this is to classify the security objects and security subjects of the environment. Security objects can be classified by sensitivity (public, confidential) or job function (accounting, personnel).

Answer 91

A

Which arrangements have to be made to be allowed to hold legally sensitive data? Which data is subject to legal restrictions? Which separate handling does this data require concerning storage, access and maintenance? Which analysis may be performed on this data? If data is held online for trend analysis in summarized form, do legal restrictions apply?

Answer 92

A

Security audit information is the basis for further reviews and examinations, this is done to test the adequacy of system controls and to recommend any changes in the security policy.
The following activities are interesting or security audits; Connections, Disconnections, Access to data, Change of data, Deletion of data.

Answer 93

A

Important part of security requirements. The transfer of data from the source system (usually an OS) into a data warehouse. Mostly transmitted over a network. Precautions must be taken in order to maintain the confidentiality and integrity of the data.

Answer 94

A

Data Filtration and encryption prior to data warehouse.
Classical Security Model
Hybrid
View Based

Answer 95

A

Covers all stages, from requirement analysis to the physical schema. During this process flow, security measures are applied so that the proper and secure data warehouse schema can be developed. This may be fruitful if implemented in transactional databases, but as far as data warehouses are concerned it may be unsuitable.

Answer 96

A

Accounts get locked after failed attempts
Password history so passwords cannot be reused
Characteristics of user’s accounts are set in profile (Password lifetime, number of failed attempts, frequency of changes, minimum length).

Answer 97

A

Before the users are active, we need to assign roles.
CONNECT: access to DB and any other granted rights
RESOURCE: creation of objects
DBA: unlimited space quotas and grant privileges to other users.

Answer 98

A

CONNECT and RESOURCE don’t automatically give access to other people’s resources.
Need to grant access.
Object privilege = Select, Insert, Update, Delete
Can be granted on certain columns and not on others

Answer 99

A

There is an old debate regarding the best level at which to implement security. Databases contain data of increasing value. They are increasingly open to the outside.
Possible layers of security: By user, By table (using policies), At row level

Answer 100

A

Confidentiality: Information is not disclosed to unauthorized users
Integrity: Modifications of data must be legitimate
Availability: Authorized users should not be refused access (may happen by mistake)

Answer 101

A

Significant performance impact
Complexity due to multiple levels
Possible errors from conflicting policies
Combination of measure may also have unpredictable effects on user’s queries
May conflict with constraints set across tables (such as referential integrity)