data warehousing Flashcards
data warehouse v database
it is a database but there is no frequent updating, it houses historical data, usually kept separate from the organisations operational database, helps in the integration of a diversity of applications, can use the data to help make stategic decisions
data warehousing features
subject oriented - provides information around a subject rather than ongoing operations. focuses on modelling and analysis of data.
integrated - constructed by integrating data from many different sources such as relatoinal databases or flat files.
time-variant - the data collected is identified with a particular time period
non-volatile - previous data is not erased when new data is added
types of data warehouse
information processing- allows processing of data stored in it. can be processed by means of queryong, basic statistical analysis, reporting using crosstabs, tables, charts or graphs.
analytical processing - supports analytical processing of
the information stored in it by means of OLAP operations
– slice and dice, drill down, drill up and pivoting
* Data mining – supports knowledge discovery by finding
hidden patterns and associations, constructing analytical
models, performing classification and prediction. Results
can be presented using visualisation tools.
functions of data warehouse tools
data extraction - gather data from many different sources
data cleaning - finding and correcting errors in data
data transformation - convert the data to warehouse format
data loading - sorting, summarising, consolidating, checking integrity and building indices and patterns
refreshing - updating data sources to warehouses
process flow in a data warehouse
extract and load the data, clean and transform the data, backup and archive the data, manage queries and direct them to the appropriate data sources
ETL vs ELT
- Extract Transform Load –
extract data from source
systems, convert the data
to a structure suitable for
querying and analysis and
load data into data
warehouse - uses a separate ‘staging’
database and applies a
series of rules or functions
to the data before loading - Extract Load Transform –
different approach to
loading data - Takes data from various
sources and loads it
directly into the target
system - The system then
transforms the data on
demand to enable
analysis
data warehouse
Use a dimensional data model
* Data organised into dimension tables and fact
tables using star and snowflake schemas.
* Traditional OLTP databases automate transactional
operations – strive to eliminate data duplication
(normalisation)
* For a data warehouse - need denormalisation or
grouping of data
* Increases query performance
metadata
- Data about data e.g. the index of a book is
metadata for the contents of a book - Summarised data that leads us to detailed data
- Roadmap to the data warehouse
- Defines the warehouse objects
- Acts as a directory – helps to locate the contents of
a data warehouse
designing a data warehouse
Two main approaches
(1) Entity relationship modelling and normalisation
- Entities and relationships
- Normalisation 3NF, 4NF
(B. Inmon Approach)
why normalise data
Normalisation is a process for converting complex
data structures into simple and stable data structures
* Models are
* robust and stable
* have minimal redundancy
* 3NF
* No repeating groups
* No partial dependencies
* No transitive dependencies
* Avoid anomalies
dimensional modelling
Purpose – structure data for easy and efficient retrieval
and analysis
* ER modelling creates a single model of data required to
support the organisation’s processes
* Whereas Dimensional Modelling allows for different
individual models of interest so we can make decisions
* Model for sales
* Model for inventory
* Denormalised data structure – fewer tables, better
performance
Components of dimensional model:
* Fact Tables – contain measurements of business e.g.
sales, purchase orders, shipment (normalised)
* Can be quite large (50 billion records, 1-5 terabytes)
* Dimension Tables – store the descriptions of the
dimensions of the business e.g. product, customer,
vendor, store.
* Smaller than Fact Tables (denormalised)
denormalisation
- The more normalised your data is, the more complex the
queries needed to read the data because a single query
combines data from many tables. This puts a huge strain
on computing resources. - The data in a data warehouse does not need to be
organized for quick transactions. Therefore, data
warehouses normally use a denormalised data structure. - A denormalised data structure uses fewer tables because
it groups data and doesn’t exclude data redundancies.
Denormalisation offers better performance when reading
data for analytical purposes.
kimball vs inmon
Two approaches to data warehouse design – Bill
Inmon and Ralph Kimball
* Conflicting opinions on how data warehouses
should be structured – gives rise to two schools of
thought.
* Inmon – “Top-Down” Design
* Kimball – “Bottom Up” Design
Bill Inmon
Top-Down Approach
Provides a definite and consistent view of information as information from the data warehouse is used to create Data Marts
Strong model and hence preferred by big companies
Time, Cost and Maintenance is high
Enterprise Data Warehouse
Ralph Kimball
Bottom-Up Approach
Reports can be generated easily as Data marts are created first and it is relatively easy to interact with Data Marts.
Not as strong but data warehouse can be extended and the number of data marts can be created
Time, Cost and Maintenance are low.
Dimensional Data Warehouse
types of fact tables
Three kinds
* Transaction Fact Tables – record information related
to events – individual product sales
* Snapshot Fact Table – record information that
applies to specific moments in time – year-end
accounts
* Accumulating snapshot tables – record information
to a running tally of data – year to year sales figures
dimension tables
Time dimension – time, date, month, year
* Geography dimension – address/location information
* Product dimension – products, product
numbers
* Customer dimension – Customer name,
numbers, contact information, address
* Range dimension – range of values for
time, price and other quantities