Warehousing and BI Flashcards
Why do we use a data warehouse
- standardise from multiple sources
- saves time when building reports & analysis
- allows new ways to report and analyse
- prevent adverse impact on operational systems
- enable comparison without underlying data changing
OLAP
Online analytical Processing (large amounts records and heavy duty queries, small number of knowledge areas)
OLTP
Online transaction processing (lots of data coming from lots of different areas)
Data warehouse
(big warehouse where we store all kinds of information from all of the different source systems in our organization)
subject orientated, integrated, time variant and non-volatile collection of summary and detailed historical data used to support the strategic decision making processes for the corporation.
Once data is in a data warehouse…
it does not change (non-volatile)
How does warehoused data vary from transactional data?
- subject orientated, only a specific subject area
- integrates multiple sources of data
- contains data over time periods, not just most recent data
- data is non-volatile, it will not change once its in the warehouse
Data Mart
Data stored for particular type of analytics, highly processed and easy to navigate.
Cubes
Multi dimensional storage (software e.g., PowerBI, tableau and hardware)
Data lake vs warehouse
lake is broader and nearer real time, often has data pipelined through, limited processing (no cleaning, integration etc).
Data lake data is used by
A data scientist
Active vs classic data warehousing difference
active add operational BI (i.e., closer to real time, operational services etc)
Big Data points of definition
- Viscosity (how difficult to use and integrate)
- Volatility (how often changes are made, how long is data useful)
- Veracity (how trustworthy)
- Variability (the forms that data is stored, inconsistent across data sets)
Veracity
How trustworthy data is
Data warehousing (processes)
- extract processes
- cleansing processes
- transformation processes
- load processes
- associated control processes
- the use of meta data
3 types of datawarehouse
conventional, dimensional virtual
Conventional Data warehouse (Inmon)
source -> transformation -> staging area -> data warehouse (normalised, atomic) -> data marts (normalised) / cubes (denormalised) -> analysis
entire datawarehouse, pull out data marts
Dimensional data warehouse (kimball)
sources -> transformation -> staging area -> data marts (all as one data warehouse) -> cubes
lots of datamarts ready for use, assembled into data warehouse. Denormalised.
Dimensions vs facts
Do analysis on facts, item/thing being descirbed
Do analysis by dimensio, background/context e.g., time
Another word for a dimensional model
star schema
Dimensional modelling is (normalised/denormalised)
denormalised
What makes up a dimensional model
dimension tables (nouns)
fact tables (events/measures)
Hierarchies for dimensions are…
stored in the dimensional table itself so there is no need for a separate hierachical table.
What is hierarchical data
Data where items are linked to each other in parent-child relationships e.g., animals, mammals, primates
Additive facts
facts that can be added up across multiple dimensions without loosing value
e.g., number of units sold across product and customer etc