Data Warehouse Flashcards
What is single, complete and consistent store of data
obtained from a variety of different sources made available to end
users in a what they can understand and use in a business context.
A Data Warehouse
Why Do we use data ware houses?
- Consolidation of information resources
- Improved query performance
- Separate research and decision support functions from the
operational systems - Foundation for data mining, data visualization, advanced reporting
and OLAP tools
What is a data warehouse used for?
- Knowledge discovery
- Making consolidated reports
- Finding relationships and correlations
- Data mining
The queryable source of data in the enterprise that
is comprised of the union of all of its constituent data marts is called what?
Data Warehouse
A logical subset of the complete data warehouse that is often
viewed as a restriction of the data warehouse to a single business
process or to a group of related business processes targeted toward a
particular business group is called what?
Data Mart
What is a point of integration for operational
systems that developed independent of each other.
Operational Data Store (ODS)
What are Kimball methodologies?
- Logical data warehouse (BUS), made up of subject areas (data marts)
- Business driven, users have active participation
- Decentralized data marts (not required to be a separate physical data store)
- Independent dimensional data marts optimized for reporting/analytics
- Integrated via Conformed Dimensions (provides consistency across data sources)
- 2-tier (data mart, cube), less ETL, no data duplication
What are Inmon methodologies?
- Enterprise data model (CIF) that is a enterprise data warehouse (EDW)
- IT Driven, users have passive participation
- Centralized atomic normalized tables (off limit to end users)
- Later create dependent data marts that are separate physical subsets of data and can be used for multiple
purposes - Integration via enterprise data model
- 3-tier (data warehouse, data mart, cube), duplication of data
What are the differences between ETL vs ELT?
Extract, Transform, and Load (ETL)
* Transform while hitting source system
* No staging tables
* Processing done by ETL tools (SSIS)
Extract, Load, Transform (ELT)
* Uses staging tables
* Processing done by target database engine (SSIS: Execute T-SQL Statement task instead of Data Flow
Transform tasks)
* Use for big volumes of data
* Use when source and target databases are the same
* Use with the Analytics Platform System (APS)
ELT is better since database engine is more efficient than SSIS
* Best use of database engine: Transformations
* Best use of SSIS: Data pipeline and workflow management
What is a unique identifier not derived from source system?
Surrogate Keys
Describe a star schema?
- Data-modeling technique
- Maps multidimensional decision support data into relational
database - It creates near equivalent of multidimensional database
schema from relational data - It yield easily implemented model for multidimensional
data analysis while preserving relational structures - On which the operational database is built
- The basic star schema has
- Four components: facts, dimensions, attributes, and attribute
hierarchies
Describe facts
- Numeric measurements that represent specific business aspect or
activity - Normally stored in fact table that is center of star schema
- Fact table contains facts linked through their dimensions
- Metrics are facts computed at run time
Describe Dimensions
- Qualifying characteristics provide additional perspectives to a given
fact - Decision support data almost always viewed in relation to other data
- Study facts via dimensions
- Dimensions stored in dimension tables
- Each dimension record is related to thousands of fact records
- Dimensions are represented in physical tables in data warehouse
database - Dimension tables are smaller than fact tables
Describe slowly Changing Dimensions (SCD)
- Dimensions that change slowly over time, rather than changing on
regular schedule, time-base. - In Data Warehouse there is a need to track changes in dimension
attributes in order to report historical data. - Implementing one of the SCD types should enable users assigning
proper dimension’s attribute value for given date. - Example of such dimensions could be: customer, geography, employee.
How do you handle SCD types?
- Type 0 - The passive method
- Type 1 - Overwriting the old value
- Type 2 - Creating a new additional record
- Type 3 - Adding a new column
- Type 4 - Using historical table
- Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)
SCD Type 0
- The passive method. In this method no special action is performed
upon dimensional changes. Some dimension data can remain the
same as it was first time inserted, others may be overwritten.
SCD Type 1
- Overwriting the old value. In this method no history of dimension
changes is kept in the database. The old dimension value is simply
overwritten be the new one. This type is easy to maintain and is often
use for data which changes are caused by processing corrections(e.g.
removal special characters, correcting spelling errors)
SCD Type 2
- Creating a new additional record. In this methodology all history of dimension
changes is kept in the database. - You capture attribute change by adding a new row with a new surrogate key to
the dimension table. Both the prior and new rows contain as attributes the
natural key(or other durable identifier).
SDC type 3
- Adding a new column. In this type usually only the current and
previous value of dimension is kept in the database. The new value is loaded into ‘current/new’ column and the old one into ‘old/previous’ column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly
needed technique.
SCD Type 4
- Using historical table. In this method a separate historical table is used to track all dimension’s attribute historical changes for each of the dimension. The ‘main’ dimension table keeps only the current data e.g. customer and customer_history tables.
SDC Type 6
Combine approaches of types 1,2,3
(1+2+3=6). In this type we have in dimension table such additional columns
* current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.
- historical_type - for keeping historical value of the attribute. All
history records for given item of attribute could have different values. - start_date - for keeping start date of ‘effective date’ of attribute’s
history. - end_date - for keeping end date of ‘effective date’ of attribute’s
history. - current_flag - for keeping information about the most recent record.
Describe Attributes
- Each dimension table contains attributes.
- Attributes are often used to search, filter or classify facts.
- Dimensions provide descriptive characteristics about the facts
through their attributes. - The data warehouse designer must define common business
attributes that will be used by the data analyst to narrow - No mathematical limit to the number of dimensions
- Slice and dice: focus on slices of the data cube for more detailed
analysis
Describe Attribute Hierarchies
- Provide top-down data organization
- Two purposes:
- Aggregation
- Drill-down/roll-up data analysis
- Determine how the data are extracted and represented
- Stored in the DBMS’s data dictionary
- Used by OLAP tool to access warehouse properly
What is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are
normalized?
Snowflake schema
Describe Fast Constellation Schema
- For each star schema it is possible to construct fact constellation schema(for example by splitting the original star schema into more
star schemes each of them describes facts on another level of
dimension hierarchies). - The fact constellation architecture contains multiple fact tables that
share many dimension tables.
Describe a Data Lake
- A data lake is a storage repository that holds a vast amount of raw data in its native
format until it is needed. - A data lake includes structured data from relational databases (rows and columns),
semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents,
PDFs) and even binary data (images, audio, video) thus creating a centralized data
store accommodating all forms of data