data warehousing Flashcards

Question 1

Q

data warehouse v database

Answer

A

it is a database but there is no frequent updating, it houses historical data, usually kept separate from the organisations operational database, helps in the integration of a diversity of applications, can use the data to help make stategic decisions

Question 2

Q

data warehousing features

Answer

A

subject oriented - provides information around a subject rather than ongoing operations. focuses on modelling and analysis of data.
integrated - constructed by integrating data from many different sources such as relatoinal databases or flat files.
time-variant - the data collected is identified with a particular time period
non-volatile - previous data is not erased when new data is added

Question 3

Q

types of data warehouse

Answer

A

information processing- allows processing of data stored in it. can be processed by means of queryong, basic statistical analysis, reporting using crosstabs, tables, charts or graphs.
analytical processing - supports analytical processing of
the information stored in it by means of OLAP operations
– slice and dice, drill down, drill up and pivoting
* Data mining – supports knowledge discovery by finding
hidden patterns and associations, constructing analytical
models, performing classification and prediction. Results
can be presented using visualisation tools.

Question 4

Q

functions of data warehouse tools

Answer

A

data extraction - gather data from many different sources
data cleaning - finding and correcting errors in data
data transformation - convert the data to warehouse format
data loading - sorting, summarising, consolidating, checking integrity and building indices and patterns
refreshing - updating data sources to warehouses

Question 5

Q

process flow in a data warehouse

Answer

A

extract and load the data, clean and transform the data, backup and archive the data, manage queries and direct them to the appropriate data sources

Question 6

Q

ETL vs ELT

Answer

A

Extract Transform Load –
extract data from source
systems, convert the data
to a structure suitable for
querying and analysis and
load data into data
warehouse
uses a separate ‘staging’
database and applies a
series of rules or functions
to the data before loading
Extract Load Transform –
different approach to
loading data
Takes data from various
sources and loads it
directly into the target
system
The system then
transforms the data on
demand to enable
analysis

Question 7

Q

data warehouse

Answer

A

Use a dimensional data model
* Data organised into dimension tables and fact
tables using star and snowflake schemas.
* Traditional OLTP databases automate transactional
operations – strive to eliminate data duplication
(normalisation)
* For a data warehouse - need denormalisation or
grouping of data
* Increases query performance

Question 8

Q

metadata

Answer

A

Data about data e.g. the index of a book is
metadata for the contents of a book
Summarised data that leads us to detailed data
Roadmap to the data warehouse
Defines the warehouse objects
Acts as a directory – helps to locate the contents of
a data warehouse

Question 9

Q

designing a data warehouse

Answer

A

Two main approaches
(1) Entity relationship modelling and normalisation
- Entities and relationships
- Normalisation 3NF, 4NF
(B. Inmon Approach)

Question 10

Q

why normalise data

Answer

A

Normalisation is a process for converting complex
data structures into simple and stable data structures
* Models are
* robust and stable
* have minimal redundancy
* 3NF
* No repeating groups
* No partial dependencies
* No transitive dependencies
* Avoid anomalies

Question 11

Q

dimensional modelling

Answer

A

Purpose – structure data for easy and efficient retrieval
and analysis
* ER modelling creates a single model of data required to
support the organisation’s processes
* Whereas Dimensional Modelling allows for different
individual models of interest so we can make decisions
* Model for sales
* Model for inventory
* Denormalised data structure – fewer tables, better
performance

Components of dimensional model:
* Fact Tables – contain measurements of business e.g.
sales, purchase orders, shipment (normalised)
* Can be quite large (50 billion records, 1-5 terabytes)
* Dimension Tables – store the descriptions of the
dimensions of the business e.g. product, customer,
vendor, store.
* Smaller than Fact Tables (denormalised)

Question 12

Q

denormalisation

Answer

A

The more normalised your data is, the more complex the
queries needed to read the data because a single query
combines data from many tables. This puts a huge strain
on computing resources.
The data in a data warehouse does not need to be
organized for quick transactions. Therefore, data
warehouses normally use a denormalised data structure.
A denormalised data structure uses fewer tables because
it groups data and doesn’t exclude data redundancies.
Denormalisation offers better performance when reading
data for analytical purposes.

Question 13

Q

kimball vs inmon

Answer

A

Two approaches to data warehouse design – Bill
Inmon and Ralph Kimball
* Conflicting opinions on how data warehouses
should be structured – gives rise to two schools of
thought.
* Inmon – “Top-Down” Design
* Kimball – “Bottom Up” Design

Bill Inmon
Top-Down Approach
Provides a definite and consistent view of information as information from the data warehouse is used to create Data Marts
Strong model and hence preferred by big companies
Time, Cost and Maintenance is high
Enterprise Data Warehouse

Ralph Kimball
Bottom-Up Approach
Reports can be generated easily as Data marts are created first and it is relatively easy to interact with Data Marts.
Not as strong but data warehouse can be extended and the number of data marts can be created
Time, Cost and Maintenance are low.
Dimensional Data Warehouse

Question 14

Q

types of fact tables

Answer

A

Three kinds
* Transaction Fact Tables – record information related
to events – individual product sales
* Snapshot Fact Table – record information that
applies to specific moments in time – year-end
accounts
* Accumulating snapshot tables – record information
to a running tally of data – year to year sales figures

Question 15

Q

dimension tables

Answer

A

Time dimension – time, date, month, year
* Geography dimension – address/location information
* Product dimension – products, product
numbers
* Customer dimension – Customer name,
numbers, contact information, address
* Range dimension – range of values for
time, price and other quantities

Question 16

Q

how dimension tables and fact tables work together

Answer

A

Dimension tables list surrogate primary key –
consists of a single column integer related to the
natural primary key e.g. you can assign an ID to
each store (in a dimension table of stores) and its
row of information e.g. name, size, location etc.
* Wherever you list the Store ID number on the Fact
Table, it will map to that specific row of store data
on the dimension table

Question 17

Q

where the data is kept

Answer

A

Data warehouses don’t replace or alter the data in
the original sources, e.g. the RDBMS in daily use
* The data warehouse stores another copy of the data
and is updated periodically from the varying sources
OR it integrates data as needed from the sources
* Data warehouses shouldn’t be used to replace a
database (of any architecture)

Question 18

Q

business intelligence software

Answer

A

Other examples include:
* Data cleansing – software to try and cleanse dirty
data
* Local information systems – some overlap with
geographic information systems – to report
statistical data with a geographic aim

Not all organisations will use all of the different
types of business intelligence software
* Will be used often by larger organisations though
not exclusively
* Cost reasons – originally, such software would have been
expensive, but now some, like Apache Hive, are free
* Need the volume of data to actually store in it
* Need time to perform the analytics and make decisions
– smaller organisations may be more focused on getting
their product(s) to market

Question 19

Q

data warehouse real-world example

Answer

A

Apache Hive: https://hive.apache.org/
* Built on top of various other Apache products including
Hadoop which is designed to run queries over a
distributed data store comprised of many computers
(and is a big data application)
* Features a relational database of metadata at its core
* The website is very detailed for anyone wanting more
info but remember it’s probably not something to try at
home as you probably don’t have that much data

Question 20

Q

data warehouse organisation

Answer

A

Last time we saw two approaches, Inmon and
Kimball
* Inmon’s approach was similar to E-R modelling
we’ve looked at before
* Kimball’s dimensional modelling approach features
a different type of organisation – star schemas or
snowflake schemas

Question 21

Q

denormalisation in star schemas

Answer

A

Goal is to speed up read queries and analysis
* Traditional normalisation removes duplicate copies of the
same data
* made write commands faster
* Many dimensions makes read and analyse slower
* Star Schema pulls the fact data from the dimension tables,
duplicates the information and stores it in the fact table.
* The fact table connects all the information sources
together.
* Slows down write commands

Question 22

Q

benefits of star schemas

Answer

A

Queries are simpler - all
the data connects
through the fact table
the multiple dimension
tables are treated as
one large table of
information
* Easier business insights
– simplifies pulling
business reports
* Better performing
queries – removes
bottlenecks of
normalised schemas,
faster queries, read
only performance
better
* Can use Star Schemas
to build OLAP cubes

Question 23

Q

challengers of star schemas

Answer

A

Decreased data
integrity – due to
denormalisation
* Less capable of
handling complex
queries
* No Many to Many
relationships – simple
dimension schema does
not work well for this.

Question 24

Q

the snowflake schema

Answer

A

The purpose of a snowflake schema is to normalise the data in a star schema. This helps to
solve slow down of write commands
* It is a multi-dimensional structure with fact tables at the core that connect the information
found in dimension tables. The difference here is that the dimension tables in the snowflake
schema divide themselves into more than one table – creating the snowflake pattern
* The snowflake schema splits the fact table into a series of normalised dimension tables.
Normalising creates more dimension tables with multiple joins and reduces data integrity
issues. However, querying is more challenging using this schema because queries need to dig
deeper to access the relevant data.
* In the snowflake schema, dimensions are normalised into multiple related tables, whereas
the star schema’s dimensions are denormalized with each dimension represented by a single
table. A complex snowflake shape emerges when the dimensions of a snowflake schema are
elaborate, having multiple levels of relationships, and the child tables have multiple parent
tables
* Normalising attributes results in storage savings, the trade-off being additional complexity in
source query joins

Question 25

Q

OLAP servers

Answer

A

OLAP servers are based on a multi-dimensional
view of data (data cubes)
* So discuss OLAP operations in multi-dimensional
data
* Roll-up
* Drill-down
* Slice and dice
* Pivot (rotate)

Question 26

Q

OLAP servers roll-up

Answer

A

Performs aggregation on
a data cube by:
* Climbing up a concept
hierarchy for a dimension
* Dimension reduction –
data grouped by city
rather than country in
this case
* One or more dimensions
of the data cube are
removed

Question 27

Q

OLAP servers drill-down

Answer

A

Reverse of roll-up.
* Two ways:
* Stepping down a concept
hierarchy for a dimension
* Introducing a new
dimension (month here)
* One or more dimensions
are added to data cube
* Go from less detailed
data to highly detailed
data

Question 28

Q

OLAP - slice

Answer

A

Selects one particular
dimension from a cube
and provides a new
sub-cube
* Slice here – time Q1

Question 29

Q

OLAP - dice

Answer

A

Selects two or more
dimensions and
provides a new sub-
cube
* Here involves location,
time and item.

Question 30

Q

OLAP - pivot

Answer

A

Rotate data axes to
provide alternative
presentation of data
* So here looking at item
types on the left and
cities on the bottom