Data Warehousing with Apache Hive Flashcards

Question

Which schema should be used based on business requirements?

Answer 1

Star schemas are great for straightforward, large-volume data querying. Snowflake schemas suit scenarios needing normalized data structures for maintaining data integrity. Galaxy schemas fit complex analytical needs across various business domains.

Answer 2

A data mart is a subset of a data warehouse focused on a specific business line or team. It contains data relevant to a particular group, like sales, finance, or marketing, making it more manageable and specific than the broader data warehouse.

Answer 3

The main purpose of a data mart is to provide business users with access to relevant data tailored to their specific needs. It simplifies data analysis by offering a more focused view of data, which is particularly useful for department-specific reporting and analysis.

Answer 4

There are three main types of data marts: independent, dependent, and hybrid. Independent data marts are created without a data warehouse. Dependent data marts are sourced from an existing data warehouse. Hybrid data marts combine both approaches.

Answer 5

Data marts offer improved performance for specific queries, ease of use for end-users, lower cost than full-scale data warehouses, and quicker implementation. They are tailored to specific needs, providing more relevant insights.

Answer 6

A data mart is a subset of a data warehouse designed for a specific line of business or purpose. While a data warehouse contains a broad view of the company's data for all departments, a data mart focuses on specific needs, making it smaller and more focused.

Answer 7

Slowly Changing Dimensions are dimensions in a data warehouse that undergo changes infrequently but must be managed to maintain historical accuracy of data. These changes could be due to updates in business processes or alterations in data interpretation.

Answer 8

SCD Type 1 involves overwriting the old data with new data, without keeping historical data. This is suitable when it's not necessary to keep track of historical changes.

Answer 9

SCD Type 2 involves keeping multiple rows to store historical data, with new records added for changes. This method maintains the history of dimensional changes, useful for tracking trends over time.

Answer 10

SCD Type 3 keeps the current and previous value in the same row. This method limits the history to a specific number of changes and is less commonly used.

Answer 11

SCD Type 0 refers to a dimension attribute that never changes once it has been loaded into the data warehouse. It's a passive method where historical data remains as it was at the time of its initial load, with no updates or changes allowed. This type is used for permanent or historical data that must be preserved in its original state, such as birth dates or original account numbers, which are significant for historical reporting and analysis and should not be altered.

Answer 12

A surrogate key is an artificial or synthetic key used in a database table. It's a unique identifier for each row in the table, typically assigned by the database system itself. Surrogate keys are not derived from application data, unlike natural keys which are derived from meaningful data. They are often used in data warehousing to provide a unique identifier for each record, regardless of any changes to the actual data. Surrogate keys are useful for simplifying relationships between tables and improving query performance.

Answer 13

SCD Type 2 is used to track and store the entire history of data changes. It involves adding new records to the dimension table with each change, preserving historical data for accurate analysis over time.

Answer 14

It includes features like maintaining multiple historical records for each change, using surrogate keys for uniqueness, and possibly date/time stamps to record when changes occurred.

Answer 15

Implementation involves creating new entries in the dimension table for each change, along with effective and expiration dates for each record, and a flag to indicate the most current record.

Answer 16

It's crucial for detailed historical analysis, allowing businesses to track changes over time and understand trends and patterns in historical data, leading to more informed decision-making.

Answer 17

Considerations include the impact on storage and query performance, the complexity of managing historical data, and the need for robust data management practices to handle the volume and complexity of the data.

Data Warehousing with Apache Hive Flashcards

(41 cards)