5. Data Warehousing Flashcards
What is a data warehouse?
A data warehouse is a centralized repository for storing and managing large volumes of data from multiple sources.
It differs from a database in that it is optimized for analysis and reporting rather than transaction processing.
How does a data warehouse differ from a database?
A data warehouse is designed for analytical queries and reporting, while a database is optimized for transaction processing.
Data warehouses typically support complex queries and large datasets.
What is star schema?
Star schema is a type of database schema that organizes data into fact and dimension tables, resembling a star shape.
It is used for simplifying complex queries.
What is snowflake schema?
Snowflake schema is a more complex database schema that normalizes dimension tables into multiple related tables.
This can reduce data redundancy but may complicate queries.
How do you design a dimensional model for reporting?
Design a dimensional model by identifying the business processes, defining fact and dimension tables, and establishing relationships.
Focus on user requirements for reporting.
What is the role of fact tables in a warehouse?
Fact tables store quantitative data for analysis and are often denormalized for performance.
They typically contain metrics and foreign keys to dimension tables.
What is the role of dimension tables in a warehouse?
Dimension tables provide context to the data in fact tables, containing descriptive attributes for analysis.
They help in filtering and grouping data.
What are surrogate keys in a data warehouse?
Surrogate keys are unique identifiers for records in a data warehouse, often used instead of natural keys.
They help maintain data integrity and simplify joins.
How do you handle slowly changing dimensions (SCD)?
Handle slowly changing dimensions by implementing strategies such as Type 1 (overwrite), Type 2 (historical), or Type 3 (limited history).
Choose the method based on business requirements.
What is ETL testing in the context of a data warehouse?
ETL testing involves validating the Extract, Transform, Load processes to ensure data accuracy and integrity.
It ensures that data is correctly loaded into the warehouse.
How do you choose between Redshift, BigQuery, and Snowflake?
Choose based on factors like workload type, scalability, cost, and integration with existing tools.
Each platform has unique strengths and pricing models.
What is query optimization in a data warehouse?
Query optimization is the process of improving the performance of database queries to reduce execution time and resource usage.
Techniques include indexing, partitioning, and rewriting queries.
How does columnar storage improve performance in a data warehouse?
Columnar storage improves performance by storing data in columns rather than rows, allowing for faster data retrieval and compression.
This is particularly beneficial for analytical queries.