data warehousing Flashcards

1
Q

data warehouse v database

A

it is a database but there is no frequent updating, it houses historical data, usually kept separate from the organisations operational database, helps in the integration of a diversity of applications, can use the data to help make stategic decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

data warehousing features

A

subject oriented - provides information around a subject rather than ongoing operations. focuses on modelling and analysis of data.
integrated - constructed by integrating data from many different sources such as relatoinal databases or flat files.
time-variant - the data collected is identified with a particular time period
non-volatile - previous data is not erased when new data is added

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

types of data warehouse

A

information processing- allows processing of data stored in it. can be processed by means of queryong, basic statistical analysis, reporting using crosstabs, tables, charts or graphs.
analytical processing - supports analytical processing of
the information stored in it by means of OLAP operations
– slice and dice, drill down, drill up and pivoting
* Data mining – supports knowledge discovery by finding
hidden patterns and associations, constructing analytical
models, performing classification and prediction. Results
can be presented using visualisation tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

functions of data warehouse tools

A

data extraction - gather data from many different sources
data cleaning - finding and correcting errors in data
data transformation - convert the data to warehouse format
data loading - sorting, summarising, consolidating, checking integrity and building indices and patterns
refreshing - updating data sources to warehouses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

process flow in a data warehouse

A

extract and load the data, clean and transform the data, backup and archive the data, manage queries and direct them to the appropriate data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ETL vs ELT

A
  • Extract Transform Load –
    extract data from source
    systems, convert the data
    to a structure suitable for
    querying and analysis and
    load data into data
    warehouse
  • uses a separate ‘staging’
    database and applies a
    series of rules or functions
    to the data before loading
  • Extract Load Transform –
    different approach to
    loading data
  • Takes data from various
    sources and loads it
    directly into the target
    system
  • The system then
    transforms the data on
    demand to enable
    analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

data warehouse

A

Use a dimensional data model
* Data organised into dimension tables and fact
tables using star and snowflake schemas.
* Traditional OLTP databases automate transactional
operations – strive to eliminate data duplication
(normalisation)
* For a data warehouse - need denormalisation or
grouping of data
* Increases query performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

metadata

A
  • Data about data e.g. the index of a book is
    metadata for the contents of a book
  • Summarised data that leads us to detailed data
  • Roadmap to the data warehouse
  • Defines the warehouse objects
  • Acts as a directory – helps to locate the contents of
    a data warehouse
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

designing a data warehouse

A

Two main approaches
(1) Entity relationship modelling and normalisation
- Entities and relationships
- Normalisation 3NF, 4NF
(B. Inmon Approach)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

why normalise data

A

Normalisation is a process for converting complex
data structures into simple and stable data structures
* Models are
* robust and stable
* have minimal redundancy
* 3NF
* No repeating groups
* No partial dependencies
* No transitive dependencies
* Avoid anomalies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

dimensional modelling

A

Purpose – structure data for easy and efficient retrieval
and analysis
* ER modelling creates a single model of data required to
support the organisation’s processes
* Whereas Dimensional Modelling allows for different
individual models of interest so we can make decisions
* Model for sales
* Model for inventory
* Denormalised data structure – fewer tables, better
performance

Components of dimensional model:
* Fact Tables – contain measurements of business e.g.
sales, purchase orders, shipment (normalised)
* Can be quite large (50 billion records, 1-5 terabytes)
* Dimension Tables – store the descriptions of the
dimensions of the business e.g. product, customer,
vendor, store.
* Smaller than Fact Tables (denormalised)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

denormalisation

A
  • The more normalised your data is, the more complex the
    queries needed to read the data because a single query
    combines data from many tables. This puts a huge strain
    on computing resources.
  • The data in a data warehouse does not need to be
    organized for quick transactions. Therefore, data
    warehouses normally use a denormalised data structure.
  • A denormalised data structure uses fewer tables because
    it groups data and doesn’t exclude data redundancies.
    Denormalisation offers better performance when reading
    data for analytical purposes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

kimball vs inmon

A

Two approaches to data warehouse design – Bill
Inmon and Ralph Kimball
* Conflicting opinions on how data warehouses
should be structured – gives rise to two schools of
thought.
* Inmon – “Top-Down” Design
* Kimball – “Bottom Up” Design

Bill Inmon
Top-Down Approach
Provides a definite and consistent view of information as information from the data warehouse is used to create Data Marts
Strong model and hence preferred by big companies
Time, Cost and Maintenance is high
Enterprise Data Warehouse

Ralph Kimball
Bottom-Up Approach
Reports can be generated easily as Data marts are created first and it is relatively easy to interact with Data Marts.
Not as strong but data warehouse can be extended and the number of data marts can be created
Time, Cost and Maintenance are low.
Dimensional Data Warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

types of fact tables

A

Three kinds
* Transaction Fact Tables – record information related
to events – individual product sales
* Snapshot Fact Table – record information that
applies to specific moments in time – year-end
accounts
* Accumulating snapshot tables – record information
to a running tally of data – year to year sales figures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

dimension tables

A

Time dimension – time, date, month, year
* Geography dimension – address/location information
* Product dimension – products, product
numbers
* Customer dimension – Customer name,
numbers, contact information, address
* Range dimension – range of values for
time, price and other quantities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

how dimension tables and fact tables work together

A

Dimension tables list surrogate primary key –
consists of a single column integer related to the
natural primary key e.g. you can assign an ID to
each store (in a dimension table of stores) and its
row of information e.g. name, size, location etc.
* Wherever you list the Store ID number on the Fact
Table, it will map to that specific row of store data
on the dimension table

16
Q

where the data is kept

A

Data warehouses don’t replace or alter the data in
the original sources, e.g. the RDBMS in daily use
* The data warehouse stores another copy of the data
and is updated periodically from the varying sources
OR it integrates data as needed from the sources
* Data warehouses shouldn’t be used to replace a
database (of any architecture)

17
Q

business intelligence software

A

Other examples include:
* Data cleansing – software to try and cleanse dirty
data
* Local information systems – some overlap with
geographic information systems – to report
statistical data with a geographic aim

Not all organisations will use all of the different
types of business intelligence software
* Will be used often by larger organisations though
not exclusively
* Cost reasons – originally, such software would have been
expensive, but now some, like Apache Hive, are free
* Need the volume of data to actually store in it
* Need time to perform the analytics and make decisions
– smaller organisations may be more focused on getting
their product(s) to market

18
Q

data warehouse real-world example

A

Apache Hive: https://hive.apache.org/
* Built on top of various other Apache products including
Hadoop which is designed to run queries over a
distributed data store comprised of many computers
(and is a big data application)
* Features a relational database of metadata at its core
* The website is very detailed for anyone wanting more
info but remember it’s probably not something to try at
home as you probably don’t have that much data

19
Q

data warehouse organisation

A

Last time we saw two approaches, Inmon and
Kimball
* Inmon’s approach was similar to E-R modelling
we’ve looked at before
* Kimball’s dimensional modelling approach features
a different type of organisation – star schemas or
snowflake schemas

20
Q

denormalisation in star schemas

A

Goal is to speed up read queries and analysis
* Traditional normalisation removes duplicate copies of the
same data
* made write commands faster
* Many dimensions makes read and analyse slower
* Star Schema pulls the fact data from the dimension tables,
duplicates the information and stores it in the fact table.
* The fact table connects all the information sources
together.
* Slows down write commands

21
Q

benefits of star schemas

A

Queries are simpler - all
the data connects
through the fact table
the multiple dimension
tables are treated as
one large table of
information
* Easier business insights
– simplifies pulling
business reports
* Better performing
queries – removes
bottlenecks of
normalised schemas,
faster queries, read
only performance
better
* Can use Star Schemas
to build OLAP cubes

22
Q

challengers of star schemas

A

Decreased data
integrity – due to
denormalisation
* Less capable of
handling complex
queries
* No Many to Many
relationships – simple
dimension schema does
not work well for this.

23
Q

the snowflake schema

A

The purpose of a snowflake schema is to normalise the data in a star schema. This helps to
solve slow down of write commands
* It is a multi-dimensional structure with fact tables at the core that connect the information
found in dimension tables. The difference here is that the dimension tables in the snowflake
schema divide themselves into more than one table – creating the snowflake pattern
* The snowflake schema splits the fact table into a series of normalised dimension tables.
Normalising creates more dimension tables with multiple joins and reduces data integrity
issues. However, querying is more challenging using this schema because queries need to dig
deeper to access the relevant data.
* In the snowflake schema, dimensions are normalised into multiple related tables, whereas
the star schema’s dimensions are denormalized with each dimension represented by a single
table. A complex snowflake shape emerges when the dimensions of a snowflake schema are
elaborate, having multiple levels of relationships, and the child tables have multiple parent
tables
* Normalising attributes results in storage savings, the trade-off being additional complexity in
source query joins

24
Q

OLAP servers

A

OLAP servers are based on a multi-dimensional
view of data (data cubes)
* So discuss OLAP operations in multi-dimensional
data
* Roll-up
* Drill-down
* Slice and dice
* Pivot (rotate)

25
Q

OLAP servers roll-up

A

Performs aggregation on
a data cube by:
* Climbing up a concept
hierarchy for a dimension
* Dimension reduction –
data grouped by city
rather than country in
this case
* One or more dimensions
of the data cube are
removed

26
Q

OLAP servers drill-down

A

Reverse of roll-up.
* Two ways:
* Stepping down a concept
hierarchy for a dimension
* Introducing a new
dimension (month here)
* One or more dimensions
are added to data cube
* Go from less detailed
data to highly detailed
data

27
Q

OLAP - slice

A

Selects one particular
dimension from a cube
and provides a new
sub-cube
* Slice here – time Q1

28
Q

OLAP - dice

A

Selects two or more
dimensions and
provides a new sub-
cube
* Here involves location,
time and item.

29
Q

OLAP - pivot

A

Rotate data axes to
provide alternative
presentation of data
* So here looking at item
types on the left and
cities on the bottom