Data Warehouse Flashcards
Characteristics
- Object-Oriented:
○ Main object types: .COM and .NET- Integrated:
○ Integration in a data warehouse involves bringing together similar data from different databases into a reliable and shared entity.
○ The integration process ensures consistent naming conventions, format, and codes, enabling effective analysis. - Time-Variant:
○ The data warehouse contains not only the current data but also a historical perspective, allowing users to analyze and understand changes and trends over time. - Non-Volatile:
○ Data in a data warehouse is permanent and not erased or deleted when new data is inserted.
○ Data is read-only, refreshed at intervals.
- Integrated:
Data Operations
○ Two primary data operations in a data warehouse are Data Loading and Data Access.
○ Data loading involves extracting, transforming, and loading data into the warehouse. (ETL Operation)
○ Data access enables users to retrieve and analyze integrated data for decision-making and reporting.
Other Characteristics
→ Concurrent Use:
○ Data warehouses support concurrent use, allowing many different users to access and utilize the data simultaneously.
○ This is particularly useful in scenarios where multiple users want to interact with the data at the same time, such as when booking airline tickets to the same destination
→ Metadata: ○ Metadata is crucial in a data warehouse environment. ○ It provides information about the data, such as its source, meaning, relationships, and transformations. ○ Metadata helps users understand and manage the data within the warehouse. → Multi-dimensional Data Model: ○ Data warehouses often use a multi-dimensional data model, which means they organize data into dimensions and facts. ○ Dimensions are the perspectives by which data can be analyzed (e.g., time, geography), and facts are the measurable metrics (e.g., sales, revenue). → Aggregated Data: ○ Data in warehouses is often aggregated, meaning that it is collected and summarized from various sources to provide a higher-level perspective and support efficient querying and reporting. ○ Aggregations can include summaries, averages, totals, or other statistical measures. → Dimension Tables: ○ Data warehouses organize information into dimension tables, where each table represents data from a specific perspective. ○ For instance, understanding the effect of inflation on the Euro from a time-based perspective. → Synchronization Periodically: ○ Data warehouses synchronize periodically, typically through periodic transactions rather than after every transaction. ○ This approach is chosen to prevent potential crashes that might occur if synchronization happens too frequently. → Software Independent: ○ Definition: Data warehouses are software-independent, meaning they are not tied to a specific database software, and they can handle various data, relationships, schemas, and database languages.
OLAP and Cube Operations
→ OLAP stands for Online Analytical Processing, which is a technology that enables multi-dimensional analysis of business data.
→ OLAP is used to support business intelligence and decision-making processes.
→ Grouping of data in a multidimensional matrix is called data cubes.
→ A data cubes are used to handle multidimensional data that allows for complex analysis and querying.
→ Operations on data cubes typically involve manipulating and aggregating data along different dimensions.
Data cube Operations
- Roll-up:
○ Roll-up involves summarizing data at a higher level of abstraction.
○ It is the process of moving from detailed data to more aggregated levels.
○ For example, rolling up monthly sales data to quarterly or yearly totals.- Drill-down:
○ Drill-down is the opposite of roll-up.
○ It involves breaking down aggregated data into more detailed levels.
○ For instance, drilling down from yearly sales to quarterly or monthly details. - Slice:
○ Slicing involves selecting a single layer or “slice” of the cube by fixing one dimension at a specific value.
○ This operation provides a 2D view of the data cube.
○ For example, selecting sales data for a particular region and a specific time period. - Dice:
○ Dicing involves selecting a sub-cube by fixing values for two or more dimensions.
○ This operation allows you to focus on a specific subset of the data.
○ For example, selecting sales data for a particular region, product category, and time period. - Pivot (Rotate):
○ An imaging operation that rotates the data to present it from a different angle. Pivoting rotates the data on a specific axis.
○ This operation can be useful for analyzing data from different angles.
- Drill-down:
Commands used in slicing
- Statistical Functions (MIN(), MAX(), SUM(), …)
- WHERE
- GROUP BY
- ORDER BY
- DISTINCT
N dimensions data model (Data warehouse Schema Types)
- Star Schema:
○ Description: In a star schema, there is a central computer (or node) that connects to multiple computers around it. These peripheral computers represent different dimensions of the data.- Snowflake Schema:
○ Description: Similar to the star schema, the snowflake schema also has a central device, but the computers around it can have additional computers around them, forming subnetworks. This structure resembles the branches of a snowflake. - Galaxy Schema:
○ Description: In a galaxy schema, everything is connected without a central node. All computers or nodes are interconnected, forming a network without a specific center.
- Snowflake Schema:
Business Intelligence
encompasses a range of activities and tools aimed at transforming raw data into meaningful insights for business decision-making.
Steps for Business Intelligence
- Basic Querying and Reporting:
→ Querying: Involves retrieving specific data from a database using SQL or other query languages.
→ Reporting: Involves presenting the queried data in a structured format, often through tables and charts. Reporting tools allow users to create and schedule regular reports.- Business Analysis (BA):
→ Involves the examination and interpretation of data to support decision-making.
→ Business analysts use various techniques and tools to analyze trends, patterns, and relationships within the data to provide insights that can guide strategic decisions. - Data Mining:
→ Data mining is the process of discovering patterns, trends, correlations, or valuable insights from large sets of data.
→ It involves using various techniques, algorithms, and methods to extract useful and previously unknown information from raw or unstructured data.
→ It involves the use of a range of techniques, including slicing, dicing, roll-up, pivoting, statistical techniques, and machine learning, contributing to comprehensive data analysis and to uncover hidden information within the data. - Dashboards:
→ Provide a visual representation of key performance indicators (KPIs) and other important metrics.
→ Dashboards consolidate and display data in a way that is easy to understand, allowing users to monitor the health and performance of their business at a glance.
→ They often include charts, graphs, and other visual elements.
- Business Analysis (BA):
Goals of Data Warehouse
- Prediction: Involves forecasting future trends or outcomes based on historical data.
- Descriptive: Focuses on summarizing and presenting historical data to understand past events.
- Prediction:
→ To forecast future trends, outcomes, or events based on historical data and patterns
→ Good methods:
§ Probability: Involve calculating probabilities to estimate the likelihood of specific events or outcomes.
§ Fourier Transformation: is a mathematical technique used to analyze patterns in time-dependent data, often in the frequency domain. It’s valuable for understanding periodic trends and cycles. Fourier analysis can enhance the accuracy of predictions by capturing cyclic behavior in the data.
§ Extension of Data: involves incorporating additional historical data or expanding the dataset to enhance the predictive model’s robustness. This can include gathering more historical records, leading to more accurate predictions. - Descriptive:
→ To provide a detailed and comprehensive understanding of historical data.
→ Involves summarizing and presenting historical data to describe what has happened in the past.
→ Benefit: Understanding historical patterns and trends helps in gaining insights into the factors that have influenced business performance.
Data Warehouse vs Data Models
(Dominant Operations, Goals, Type of model, Technology)
- Dominant Operations:
→ Data warehouse: Data cube operations (slicing, dicing,…)
→ Data Model: DISTINCT, GROUP BY, SELECT, WHERE, SORT (not for data mining)- Goals:
→ Data warehouse: Prediction and descriptive
→ Data Model: normalization, independence, data integrity, referential integrity, structuring data - Type of model:
→ Data warehouse: star, snowflake, and galaxy models
→ Data Model: relational, entity, and hierarchical models - Technology:
→ Data warehouse: Software independent, concurrent users, meta data, aggregated data…
Data Model: relations, constraints, DBMS, SCAN, SEEK
- Goals:
OLTP vs Warehousing
(Dominant Operations, Goals, Type of model, Technology)
OLTP:
Dominant Operations
- Maintenance Operations: Involves inserting, modifying, and deleting data as part of transactional processes.
Optimal Goals
Improving Maintenance Efficiency: Achieving acceptable transaction processing times for routine maintenance operations (insertion, modification, deletion).
Model
Normalized model with a focus on maintaining data integrity, often in a 2D structure
DB Technology
- Fundamentally Access by Content: Designed for content-based access, which may involve scanning data.
- Navigation by Foreign Key: Utilizes foreign keys for navigating relationships between tables.
VS.
Data Warehousing:
Dominant Operations
- Complex Queries: Involves querying large datasets for analytical purposes, often with aggregations, filtering, and grouping.
- Grouping and Ranking: Organizing and ranking individuals (objects, events) based on specified criteria and hierarchical structures.
Optimal Goals
Improving Query Efficiency: Achieving acceptable response times for complex queries and reporting.
Model
Multi-dimensional model with N dimensions and data blocks, supporting efficient querying and analysis.
DB Technology
- Relational Database with Star Structure: Tables exhibit relationships in a star schema, optimized for analytical queries.
- Multidimensional Database: Utilizes a multidimensional structure for faster access, often based on navigation indicators.