02 Data Warehouse Architecture, DWH Tool overview Flashcards
What is the definition of a Data Warehouse?
- A DWH is a collection of subject-oriented, integrated, nonvolatile and time-varying data to support management decisions.
- Subject-oriented: analytical requirements, contrasting analytical databases.
- Integrated: joining data from multiple sources, solving differences in data format.
- Nonvolatile: durability of data, disabling modification, expanding scope of data.
- Time-varying: older data can be retrieved, information paired with time.
- A DWH is a copy of transaction data specifically structured for query and analysis.
Graphically explain the (7) main DWH Architecture components:
- Analysis and visualization: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining, planning.
- Data Marts
- Web-based Access
- DWH Extension: Map-reduce, Hadoop, HDFS (Hadoop Distributed File System)
- DWH Core: metadata, data basis, OLAP Server
- Data Migration (ETL): selection, extraction, cleaning, transformation, load/update
- Operational Data Sources: internal data sources, external data sources (Streams, HDFS, Logs).
From the DWH Architecture, clasify and mention relevant examples for the Operational Data Sources:
Operational Data Sources can be divided in:
-
Internal Data Sources:
- ERP systems
- CRM systems
- BPM systems
- ODS (operational data stores)
- Master data management
- Other DWHs
- Flat files (CSVs, text files, others)
-
External Data Sources:
- Streams
- HDFS (Hadoop Distributed File System)
- Web feeds
- Flat files (CSVs, text files, others)
From the DWH Architecture, describe the steps and mention some examples for the Data Migration (ETL):
-
ETL steps:
- Selection (of sources)
- Extraction (of actual data for further processing)
- Cleansing (data unification rules, standardized values, validate)
- Transformation (convert data using same units, joining data, agreggates, sortings, calcultations…)
- Loading/Updating (correct fulll loads)
- About 70% of DWH dev hours go to the ETL process
- Scenarios: data aggregation, currency conversion, errors, deduplication, batch processing
- Metadata generation/repository:
- Technical (system, source data, source frequency)
- Business (display name, content descriptions)
- Process (task logs, disk reads, query time, rows processed)
What are some motives for real time ETL? Also explain some of its tradeoffs.
- Users demand higher levels of freshness (relevant for decision making, BI)
- Globalization of economy (operating in several time-zones, nigh-time update windows disappear)
- New types of data sources (live web feeds, stock market prices, more real time info available)
- Affordable technical realizations (better loading performance thanks to faster technologies => column storing, in-memory computing)
- Also has Tradeoffs:
- Possible incomplete records (replacing batch files with reads from message queues, transaction logs)
- Restricting data quality screening to some columns
- Allowing new/early facts to be posted with old dimensions (outdated master data records, danger of missing referential integrity)
- Data staging is eliminated
Mention some reasons to separate OLTP (transactions) and OLAP (analytical, DWH):
-
Response times:
- Bad query performance on operational DBs
- Read-oriented data storage in DWHs
-
Long-term data storage:
- Analyzing data over a period of time
-
Independent data access:
- No interference of operational systems
- Standardization of data formats, data harmonization:
- Enhancing data quality
Mention some key characteristics of OLTP:
- OLTP = On Line Transaction Processing
- Application: Operational, ERP, CRM, legacy Apps.
- Typical users: Staff
- Time horizon: Weeks, Months
- Refresh: Inmediate
- Data model: Entity-Relationship
- Schema: Normalized
- Emphasis: Update (constant transactions)
Mention some key characteristics of OLAP:
- OLAP = On Line Analytic Processsing
- Application: MIS, DSS, DWH
- Typical users: Managers, Executives
- Time horizon: Years
- Refresh: Periodic
- Data model: Multi-dimensional
- Schema: Star
- Emphasis: Retrieval (read)
Mention the (3) elements of the DWH Core Layer:
- MetaData
- DWH Data Basis
- OLAP Server
Mention the elements of the multidimensional model (data cubes):
- Dimensions (general axis names: products, time, location)
- Reference objects (names on the dimensions axis: México, USA, Canada)
- Facts (stored content on each individual X,Y,Z coordinate)
- Ratios
What are Data Marts?
- Data marts are specialized DWHs targeted to a particular functional area or user group.
- Can be derived from an enterprise DWH or collected directly from data sources.
- Easier to build than an enterprise DWH.
- Why a data mart approach?
- Semantic partitioning
- Archiving
- Load balancing
- Data protection
What are the (2) main data mart design approaches?
-
Top-down design
- The requirements of users are merged before the design process begins. A DWH is created.
- Separate data marts exist for each business area or process.
- Data marts are closer to reporting.
-
Bottom-up design
- Separate schemas for each data mart based on the requirements of business area.
- Later, these schemas form a global schema for the DWH.

Mention some key (points) differences between Traditional BI and Data Discovery:
-
Traditional BI:
- IT department
- Mega vendors, large independent sellers
- Top-down/Central approach
- Interface based on Dashboards/KPIs
- Reporting as main use
- Query on existing repositories (data storage)
- Deployed by Consultants
-
Data Discovery:
- Business-users oriented
- Small, fast growing independent sellers
- Bottom-up/Decentralized approach
- Interface based on Visualizations
- Analysis as main use
- Dedicated repository (data storage)
- Deployed by Users
What are some drawbacks of data mart approaches?
- No reconcilability of data (single point of truth is rlost)
- Reconcile data inequalities can be difficult
- Increasing extracts from the DWH needed
- Change propagation (changing one data marte may affect others) -> Chances of error quickly grow
- Non-extensibility (might be needed to start from scratch after big organizational changes)
MapReduce can be used to extend a DWH, why?
- Traditional and NoSQL DB’s are unfit to deal with the analysis of large quantities of data.
- Having a MapReduce component allows for Big Data analytics tasks.
- MapReduce can run complex analytical tasks in parallel.
Mention the (3) typical recepients of reports and what kind of report receive each of them:
- CEO/Management: dashboards
- Data Scientists: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining and planning.
- Business Users: standard reports, spreadsheets, ad-hoc queries, planning

What are a list of the things to consider when designing a report?
- Purpose: analytics, reporting
- Update frequency: monthly, daily, hourly, right-time, real-time
- Audience/Recipients: executives, business analysts
- Output format: screen (size), paper (size), mobile devices, web
- Data type: quantitative (booking, revenues, orders) or non-quantitative (top 10 customers, issues for investigation)
Why data visualization is becoming increasingly important/valuable?
Also mention some benefits of it.
- Tools and practices for data visualization, data discovery, visual analysis enable non-tech users to make effective user of data and reduce time to insight.
- In BigData times it is essential to provide graphical representations of data and analytical concepts.
- Visual analytics enable interactions with data and engagement in analytical processes.
- Moreover, data aggregations and summaries remain critical to support visual reporting and analytics
- Overall top benefits seeked by companies:
- Improved operational efficiency
- Faster response to business change
- Ability to identify new business opportunities
- Higher productivity
Mention some Information Dashboard Design principles, elaborate on the Gestalt principles:
Information Dashboard Design general considerations:
- Encoding data for rapid perception
- Using attributes of colours
- Using attributes and form (quantitative vs qualitative/categorical data)
Gestalt principles:
- Proximity (objects that are closer together are perceived as more related than objects that are further apart)
- Similarity (objects look similar to one another, perceive them as a coherent group or pattern, a single unit)
- Enclosure (creating groups, elements are perceived as part of a group if they are located within the same closed region, everything outside the enclosure is seen as separated)
- Closure (occurs when an object is incomplete or a space is not completely enclosed, a shape indicates a filling in the missing information)
- Continuity (occurs when the eye is compelled to move through one object and continue to another one smoothly)
- Connection (elements that are visually connected are perceived as more related than elements with no connections at all)

Define and elaborate on the Tufte’s design principle of data-ink ratio:
Tufte’s design principle of data-ink ratio: every bit of ink on a graphic required a reason, almost always that reason should be that ink actually presents new information. Applying this to data pixels:
- Reduce non-data pixels: eliminate all unnecessary non-data pixels, de-emphasize and regularize the non-data pixels that remain. Examples: getting rid of frames and using simple axis, avoid variations in color, use blank space instead of borders or lines, usin light colors in grid lines.
- Enhance data pixels: eliminate all unnecessary data pixels, highlight the most important data pixels that remain. Examples: using visual emphasis through colors, varying color intensities to highlight data dynamically.