02 Data Warehouse Architecture, DWH Tool overview Flashcards

Question 1

Q

What is the definition of a Data Warehouse?

Answer

A

A DWH is a collection of subject-oriented, integrated, nonvolatile and time-varying data to support management decisions.
- Subject-oriented: analytical requirements, contrasting analytical databases.
- Integrated: joining data from multiple sources, solving differences in data format.
- Nonvolatile: durability of data, disabling modification, expanding scope of data.
- Time-varying: older data can be retrieved, information paired with time.
A DWH is a copy of transaction data specifically structured for query and analysis.

Question 2

Q

Graphically explain the (7) main DWH Architecture components:

Answer

A

Analysis and visualization: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining, planning.
Data Marts
Web-based Access
DWH Extension: Map-reduce, Hadoop, HDFS (Hadoop Distributed File System)
DWH Core: metadata, data basis, OLAP Server
Data Migration (ETL): selection, extraction, cleaning, transformation, load/update
Operational Data Sources: internal data sources, external data sources (Streams, HDFS, Logs).

Question 3

Q

From the DWH Architecture, clasify and mention relevant examples for the Operational Data Sources:

Answer

A

Operational Data Sources can be divided in:

Internal Data Sources:
- ERP systems
- CRM systems
- BPM systems
- ODS (operational data stores)
- Master data management
- Other DWHs
- Flat files (CSVs, text files, others)
External Data Sources:
- Streams
- HDFS (Hadoop Distributed File System)
- Web feeds
- Flat files (CSVs, text files, others)

Question 4

Q

From the DWH Architecture, describe the steps and mention some examples for the Data Migration (ETL):

Answer

A

ETL steps:
- Selection (of sources)
- Extraction (of actual data for further processing)
- Cleansing (data unification rules, standardized values, validate)
- Transformation (convert data using same units, joining data, agreggates, sortings, calcultations…)
- Loading/Updating (correct fulll loads)
About 70% of DWH dev hours go to the ETL process
Scenarios: data aggregation, currency conversion, errors, deduplication, batch processing
Metadata generation/repository:
- Technical (system, source data, source frequency)
- Business (display name, content descriptions)
- Process (task logs, disk reads, query time, rows processed)

Question 5

Q

What are some motives for real time ETL? Also explain some of its tradeoffs.

Answer

A

Users demand higher levels of freshness (relevant for decision making, BI)
Globalization of economy (operating in several time-zones, nigh-time update windows disappear)
New types of data sources (live web feeds, stock market prices, more real time info available)
Affordable technical realizations (better loading performance thanks to faster technologies => column storing, in-memory computing)
Also has Tradeoffs:
- Possible incomplete records (replacing batch files with reads from message queues, transaction logs)
- Restricting data quality screening to some columns
- Allowing new/early facts to be posted with old dimensions (outdated master data records, danger of missing referential integrity)
- Data staging is eliminated

Question 6

Q

Mention some reasons to separate OLTP (transactions) and OLAP (analytical, DWH):

Answer

A

Response times:
- Bad query performance on operational DBs
- Read-oriented data storage in DWHs
Long-term data storage:
- Analyzing data over a period of time
Independent data access:
- No interference of operational systems
Standardization of data formats, data harmonization:
Enhancing data quality

Question 7

Q

Mention some key characteristics of OLTP:

Answer

A

OLTP = On Line Transaction Processing
Application: Operational, ERP, CRM, legacy Apps.
Typical users: Staff
Time horizon: Weeks, Months
Refresh: Inmediate
Data model: Entity-Relationship
Schema: Normalized
Emphasis: Update (constant transactions)

Question 8

Q

Mention some key characteristics of OLAP:

Answer

A

OLAP = On Line Analytic Processsing
Application: MIS, DSS, DWH
Typical users: Managers, Executives
Time horizon: Years
Refresh: Periodic
Data model: Multi-dimensional
Schema: Star
Emphasis: Retrieval (read)

Question 9

Q

Mention the (3) elements of the DWH Core Layer:

Answer

A

MetaData
DWH Data Basis
OLAP Server

Question 10

Q

Mention the elements of the multidimensional model (data cubes):

Answer

A

Dimensions (general axis names: products, time, location)
Reference objects (names on the dimensions axis: México, USA, Canada)
Facts (stored content on each individual X,Y,Z coordinate)
Ratios

Question 11

Q

What are Data Marts?

Answer

A

Data marts are specialized DWHs targeted to a particular functional area or user group.
Can be derived from an enterprise DWH or collected directly from data sources.
Easier to build than an enterprise DWH.
Why a data mart approach?
- Semantic partitioning
- Archiving
- Load balancing
- Data protection

Question 12

Q

What are the (2) main data mart design approaches?

Answer

A

Top-down design
- The requirements of users are merged before the design process begins. A DWH is created.
- Separate data marts exist for each business area or process.
- Data marts are closer to reporting.
Bottom-up design
- Separate schemas for each data mart based on the requirements of business area.
- Later, these schemas form a global schema for the DWH.

Question 13

Q

Mention some key (points) differences between Traditional BI and Data Discovery:

Answer

A

Traditional BI:
- IT department
- Mega vendors, large independent sellers
- Top-down/Central approach
- Interface based on Dashboards/KPIs
- Reporting as main use
- Query on existing repositories (data storage)
- Deployed by Consultants
Data Discovery:
- Business-users oriented
- Small, fast growing independent sellers
- Bottom-up/Decentralized approach
- Interface based on Visualizations
- Analysis as main use
- Dedicated repository (data storage)
- Deployed by Users

Question 14

Q

What are some drawbacks of data mart approaches?

Answer

A

No reconcilability of data (single point of truth is rlost)
Reconcile data inequalities can be difficult
Increasing extracts from the DWH needed
Change propagation (changing one data marte may affect others) -> Chances of error quickly grow
Non-extensibility (might be needed to start from scratch after big organizational changes)

Question 15

Q

MapReduce can be used to extend a DWH, why?

Answer

A

Traditional and NoSQL DB’s are unfit to deal with the analysis of large quantities of data.
Having a MapReduce component allows for Big Data analytics tasks.
MapReduce can run complex analytical tasks in parallel.

Question 16

Q

Mention the (3) typical recepients of reports and what kind of report receive each of them:

Answer

Study These Flashcards

A

CEO/Management: dashboards
Data Scientists: dashboards, standard reports, spreadsheets, ad-hoc queries, data mining and planning.
Business Users: standard reports, spreadsheets, ad-hoc queries, planning

Question 17

Q

What are a list of the things to consider when designing a report?

Answer

Study These Flashcards

A

Purpose: analytics, reporting
Update frequency: monthly, daily, hourly, right-time, real-time
Audience/Recipients: executives, business analysts
Output format: screen (size), paper (size), mobile devices, web
Data type: quantitative (booking, revenues, orders) or non-quantitative (top 10 customers, issues for investigation)

Question 18

Q

Why data visualization is becoming increasingly important/valuable?

Also mention some benefits of it.

Answer

Study These Flashcards

A

Tools and practices for data visualization, data discovery, visual analysis enable non-tech users to make effective user of data and reduce time to insight.
In BigData times it is essential to provide graphical representations of data and analytical concepts.
Visual analytics enable interactions with data and engagement in analytical processes.
Moreover, data aggregations and summaries remain critical to support visual reporting and analytics
Overall top benefits seeked by companies:
- Improved operational efficiency
- Faster response to business change
- Ability to identify new business opportunities
- Higher productivity

Question 19

Q

Mention some Information Dashboard Design principles, elaborate on the Gestalt principles:

Answer

Study These Flashcards

A

Information Dashboard Design general considerations:

Encoding data for rapid perception
Using attributes of colours
Using attributes and form (quantitative vs qualitative/categorical data)

Gestalt principles:

Proximity (objects that are closer together are perceived as more related than objects that are further apart)
Similarity (objects look similar to one another, perceive them as a coherent group or pattern, a single unit)
Enclosure (creating groups, elements are perceived as part of a group if they are located within the same closed region, everything outside the enclosure is seen as separated)
Closure (occurs when an object is incomplete or a space is not completely enclosed, a shape indicates a filling in the missing information)
Continuity (occurs when the eye is compelled to move through one object and continue to another one smoothly)
Connection (elements that are visually connected are perceived as more related than elements with no connections at all)

Question 20

Q

Define and elaborate on the Tufte’s design principle of data-ink ratio:

Answer

Study These Flashcards

A

Tufte’s design principle of data-ink ratio: every bit of ink on a graphic required a reason, almost always that reason should be that ink actually presents new information. Applying this to data pixels:

Reduce non-data pixels: eliminate all unnecessary non-data pixels, de-emphasize and regularize the non-data pixels that remain. Examples: getting rid of frames and using simple axis, avoid variations in color, use blank space instead of borders or lines, usin light colors in grid lines.
Enhance data pixels: eliminate all unnecessary data pixels, highlight the most important data pixels that remain. Examples: using visual emphasis through colors, varying color intensities to highlight data dynamically.

02 Data Warehouse Architecture, DWH Tool overview Flashcards

(20 cards)