2. Cardinality Estimation and Histograms Flashcards
How is the reduction size calculated in database projections?
- Projection: Selecting a subset of columns from a table.
- Old Tuple Size: Total size of a tuple with all original columns.
- New Tuple Size: Total size of a tuple after projection (only selected columns).
-
Reduction Factor: Measures size reduction due to projection.
- Formula: Reduction Factor = New Tuple Size / Old Tuple Size
- Example: If the original tuple size is 100 bytes and the new tuple size (after selecting half of the columns) is 50 bytes, then the Reduction Factor is 50 / 100 = 0.5. This indicates a 50% reduction in size.
Tip: The reduction factor depends on the number of columns selected and their individual sizes.
What is the Selection Operation in databases and how is its reduction size calculated?
- Selection Operation: Used to filter rows in a database based on a specific condition, denoted as σ. It selects tuples that satisfy a given criterion.
- Notation: σA=c(R) selects rows from table R where attribute A equals value c.
-
Reduction Size Formula:
- Formula: Reduction Factor = 1 / V(R,A) * T(R)
- V(R,A): Column cardinality (number of distinct values) of attribute A in table R.
- T(R): Total number of tuples in R.
-
Cardinality & Reduction Factor Example:
- If attribute A (e.g., ‘gender’) has 2 distinct values (‘male’ and ‘female’), its cardinality V(R,A) is 2.
- For σgender=’male’(R), Reduction Factor = 1 / 2 = 0.5 (50% of tuples).
- Uniformity Assumption: Assumes equal distribution of values in a column.
-
Practical Example:
- Table R with 100 tuples, attribute with 2 distinct values.
- Expected Result: With a reduction factor of 0.5, expect to retrieve 50 tuples (50%).
Note: Reduction factor calculation assumes uniform distribution of attribute values.
What are the different types of Selection Operations and their reduction factors in databases?
-
Basic Selection (σA = c(R)):
- Reduction Factor: 1 / V(R,A), where V(R,A) is the cardinality of attribute A in R.
-
Selection with ‘Greater Than’ Condition (σA>c(R)):
- Non-Arithmetic Types: Estimated reduction factor is 1 / 3.
- Arithmetic Types: Calculated as ((highVal(A) - c) / (highVal(A) - lowVal(A))) * TR).
-
Selection with ‘Not Equal’ Condition (σA ≠ c(R)):
- Reduction Factor: Typically taken as T(R), the total number of tuples in R.
-
Selection with Disjunction (σC1 ∨ C2(R)):
- Size estimate: min {T(R), size(σC1(R)) + size(σC2(R))}.
-
Selection with Conjunction (σC1 ∧ C2(R)):
- Size estimate: rfC1 x rfC2 x T(R), assuming independence between C1 and C2.
How do you estimate the result size for the union of two tables in a database?
- Maximum Size: Sum of the tuple counts of both tables, TR + TS.
- Minimum Size: Occurs when all tuples in one table are also in the other. Calculated as max{TR, TS}.
-
Estimated Size Formula:
- Estimated size is the average of the maximum and minimum sizes.
- Formula: ((TR + TS) + max{TR, TS}) / 2.
How is the result size estimated for the intersection of two tables in a database?
- Minimum Size: 0, assuming no overlap between the tuples of the two tables.
- Maximum Size: Smaller of the two tuple counts, min{TR, TS}.
-
Estimated Size Formula:
- Average of minimum and maximum sizes.
- Formula: (0 + min{TR, TS}) / 2.
How is the result size estimated for the difference of two tables (R−S) in a database?
- Minimum Size: TR - TS, assuming S completely subsumes R.
- Maximum Size: TR, occurring if no tuples in R are in S.
-
Estimated Size Formula:
- Average of the minimum and maximum sizes.
- Formula: |R - S| = (TR + (TR - TS)) / 2
= TR - (½) TS
How is the result size estimated for a join on attribute Y between two tables? Which assumptions are made about the data?
Basic Estimation (Uniformity & Independence):
- For tables R and S joined on attribute Y (R.y = S.y).
- Formula: = (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS
- V(Y, R) and V(Y, S): Column cardinalities of Y in R and S.
- TR and TS: Total number of tuples in R and S.
Summing Over All Values of Y:
- Extend calculation over all joinable values of Y by summing the formula.
Primary Key - Foreign Key Join (PK-FK Join):
- If Y is a PK-FK join, size is determined by distinct values of Y in the FK relation.
- Estimated Result Size: min(V(Y,R), V(Y, S)) * (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS.
- Assumes FK values in one table are a subset of PK values in the other.
Formalization for PK-FK Join:
- |R ⋈ S| = min{ TR (1 / V(Y, R)), TS (1 / V(Y, S)) }
Assumptions:
- Independence of columns.
- Uniform distribution of values in a single column.
- Inclusion assumption for FK: all FK values present in PK column.
What are histograms and their key characteristics in database systems?
- Definition: Histograms are data structures used for approximating the distribution of values for an attribute in a database.
- Purpose: They help in estimating reduction factors more accurately than basic estimation techniques.
-
Key Characteristics:
- Size: Typically small, designed to fit on a single disk page.
- Accuracy: Generally, histograms have an error margin of less than 5%, striking a balance between accuracy and computational efficiency.
What are Equi-Width Histograms and how can they be used for cardinality estimation?
Description: Histograms that divide the range of an attribute A into equal sub-ranges or buckets.
Implementation:
- Example: For a range 1-100, create 10 buckets each covering a range of 10 units (1-10, 11-20, etc.).
Cardinality Estimation:
- Estimate output cardinality of a range query by summing the number of tuples in relevant buckets.
- Assumes uniform distribution within each bucket.
- Example: For a query like ‘age > 35’, if the bucket 30-40 has 100 tuples, estimate 50 tuples for the range 35-40.
What is the complexity and error of Equi-withh histograms?
Complexity:
- Construction: Single pass, O(N) time, O(B) size.
- Range-query Processing: Need to scan the histogram (B).
- Maintenance: Determine the bucket and increment/decrement in O(1) time.
Error:
- Error: Can be arbitrarily bad; the error cannot be limited.
What are the limitations and potential errors in using Equi-Width Histograms in databases?
Limitations:
- Assumes uniform distribution within each bucket, which might not always hold true in real data distributions.
Potential Error:
- Error in cardinality estimation can be arbitrarily bad due to uneven distribution of values within buckets.
- Unable to limit the error margin effectively in diverse data distributions.
What are Equi-depth Histograms and how can they be used for cardinality estimation?
Title: Understanding Equi-Depth Histograms in Databases
Front:
What are Equi-Depth Histograms and how do they function in databases?
Back:
- Description: Histograms dividing the range of an attribute A into sub-ranges where each bucket contains approximately the same number of tuples.
- Cardinality Estimation:
- Use buckets to approximate the number of tuples within a query range.
- Difference from Equi-Width:
- Buckets are defined based on tuple count, not value range.
What is the complexity and error of Equi-depth histograms?
Complexity:
- Construction: Requires sorting, O(N log N) time, O(B) size.
- Range-query Processing: Scan the histogram, O(B).
- Maintenance: Reconstruct each time or adjust bucket counts, merge/split as needed.
Error:
- Estimated error can be at most ±|R|/B, providing a more predictable error margin.
What are the important considerations when using histograms in database systems?
Bucketing Scheme:
- How to partition: Buckets can overlap or be recursive to better reflect data structure.
Bucket Statistics:
- What to store: Beyond bucket boundaries and counts, additional information can be included.
Intra-Bucket Approximation Scheme:
- How to estimate answers: Typically uses a continuous-value assumption for frequency counts, but other methods are also viable.
Class of Queries Answered:
- Suitable for range-count and point-count queries.
- Not applicable for general SQL; only specific query types are supported.