02 Data Warehouse Architecture, DWH Tool overview Flashcards
What is the definition of a Data Warehouse?
- A DWH is a collection of subject-oriented, integrated, nonvolatile and time-varying data to support management decisions.
- Subject-oriented: analytical requirements, contrasting analytical databases.
- Integrated: joining data from multiple sources, solving differences in data format.
- Nonvolatile: durability of data, disabling modification, expanding scope of data.
- Time-varying: older data can be retrieved, information paired with time.
- A DWH is a copy of transaction data specifically structured for query and analysis.
Graphically explain the (7) main DWH Architecture components:
- Analysis and visualization: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining, planning.
- Data Marts
- Web-based Access
- DWH Extension: Map-reduce, Hadoop, HDFS (Hadoop Distributed File System)
- DWH Core: metadata, data basis, OLAP Server
- Data Migration (ETL): selection, extraction, cleaning, transformation, load/update
- Operational Data Sources: internal data sources, external data sources (Streams, HDFS, Logs).
From the DWH Architecture, clasify and mention relevant examples for the Operational Data Sources:
Operational Data Sources can be divided in:
-
Internal Data Sources:
- ERP systems
- CRM systems
- BPM systems
- ODS (operational data stores)
- Master data management
- Other DWHs
- Flat files (CSVs, text files, others)
-
External Data Sources:
- Streams
- HDFS (Hadoop Distributed File System)
- Web feeds
- Flat files (CSVs, text files, others)
From the DWH Architecture, describe the steps and mention some examples for the Data Migration (ETL):
-
ETL steps:
- Selection (of sources)
- Extraction (of actual data for further processing)
- Cleansing (data unification rules, standardized values, validate)
- Transformation (convert data using same units, joining data, agreggates, sortings, calcultations…)
- Loading/Updating (correct fulll loads)
- About 70% of DWH dev hours go to the ETL process
- Scenarios: data aggregation, currency conversion, errors, deduplication, batch processing
- Metadata generation/repository:
- Technical (system, source data, source frequency)
- Business (display name, content descriptions)
- Process (task logs, disk reads, query time, rows processed)
What are some motives for real time ETL? Also explain some of its tradeoffs.
- Users demand higher levels of freshness (relevant for decision making, BI)
- Globalization of economy (operating in several time-zones, nigh-time update windows disappear)
- New types of data sources (live web feeds, stock market prices, more real time info available)
- Affordable technical realizations (better loading performance thanks to faster technologies => column storing, in-memory computing)
- Also has Tradeoffs:
- Possible incomplete records (replacing batch files with reads from message queues, transaction logs)
- Restricting data quality screening to some columns
- Allowing new/early facts to be posted with old dimensions (outdated master data records, danger of missing referential integrity)
- Data staging is eliminated
Mention some reasons to separate OLTP (transactions) and OLAP (analytical, DWH):
-
Response times:
- Bad query performance on operational DBs
- Read-oriented data storage in DWHs
-
Long-term data storage:
- Analyzing data over a period of time
-
Independent data access:
- No interference of operational systems
- Standardization of data formats, data harmonization:
- Enhancing data quality
Mention some key characteristics of OLTP:
- OLTP = On Line Transaction Processing
- Application: Operational, ERP, CRM, legacy Apps.
- Typical users: Staff
- Time horizon: Weeks, Months
- Refresh: Inmediate
- Data model: Entity-Relationship
- Schema: Normalized
- Emphasis: Update (constant transactions)
Mention some key characteristics of OLAP:
- OLAP = On Line Analytic Processsing
- Application: MIS, DSS, DWH
- Typical users: Managers, Executives
- Time horizon: Years
- Refresh: Periodic
- Data model: Multi-dimensional
- Schema: Star
- Emphasis: Retrieval (read)
Mention the (3) elements of the DWH Core Layer:
- MetaData
- DWH Data Basis
- OLAP Server
Mention the elements of the multidimensional model (data cubes):
- Dimensions (general axis names: products, time, location)
- Reference objects (names on the dimensions axis: México, USA, Canada)
- Facts (stored content on each individual X,Y,Z coordinate)
- Ratios
What are Data Marts?
- Data marts are specialized DWHs targeted to a particular functional area or user group.
- Can be derived from an enterprise DWH or collected directly from data sources.
- Easier to build than an enterprise DWH.
- Why a data mart approach?
- Semantic partitioning
- Archiving
- Load balancing
- Data protection
What are the (2) main data mart design approaches?
-
Top-down design
- The requirements of users are merged before the design process begins. A DWH is created.
- Separate data marts exist for each business area or process.
- Data marts are closer to reporting.
-
Bottom-up design
- Separate schemas for each data mart based on the requirements of business area.
- Later, these schemas form a global schema for the DWH.
Mention some key (points) differences between Traditional BI and Data Discovery:
-
Traditional BI:
- IT department
- Mega vendors, large independent sellers
- Top-down/Central approach
- Interface based on Dashboards/KPIs
- Reporting as main use
- Query on existing repositories (data storage)
- Deployed by Consultants
-
Data Discovery:
- Business-users oriented
- Small, fast growing independent sellers
- Bottom-up/Decentralized approach
- Interface based on Visualizations
- Analysis as main use
- Dedicated repository (data storage)
- Deployed by Users
What are some drawbacks of data mart approaches?
- No reconcilability of data (single point of truth is rlost)
- Reconcile data inequalities can be difficult
- Increasing extracts from the DWH needed
- Change propagation (changing one data marte may affect others) -> Chances of error quickly grow
- Non-extensibility (might be needed to start from scratch after big organizational changes)
MapReduce can be used to extend a DWH, why?
- Traditional and NoSQL DB’s are unfit to deal with the analysis of large quantities of data.
- Having a MapReduce component allows for Big Data analytics tasks.
- MapReduce can run complex analytical tasks in parallel.