2. Cardinality Estimation and Histograms Flashcards

Question 1

Q

How is the reduction size calculated in database projections?

Answer

A

Projection: Selecting a subset of columns from a table.
Old Tuple Size: Total size of a tuple with all original columns.
New Tuple Size: Total size of a tuple after projection (only selected columns).
Reduction Factor: Measures size reduction due to projection.
- Formula: Reduction Factor = New Tuple Size / Old Tuple Size
Example: If the original tuple size is 100 bytes and the new tuple size (after selecting half of the columns) is 50 bytes, then the Reduction Factor is 50 / 100 = 0.5. This indicates a 50% reduction in size.

Tip: The reduction factor depends on the number of columns selected and their individual sizes.

Question 2

Q

What is the Selection Operation in databases and how is its reduction size calculated?

Answer

A

Selection Operation: Used to filter rows in a database based on a specific condition, denoted as σ. It selects tuples that satisfy a given criterion.
Notation: σ_A=c(R) selects rows from table R where attribute A equals value c.
Reduction Size Formula:
- Formula: Reduction Factor = 1 / V(R,A) * T(R)
- V(R,A): Column cardinality (number of distinct values) of attribute A in table R.
- T(R): Total number of tuples in R.
Cardinality & Reduction Factor Example:
- If attribute A (e.g., ‘gender’) has 2 distinct values (‘male’ and ‘female’), its cardinality V(R,A) is 2.
- For σ_{gender=’male’}(R), Reduction Factor = 1 / 2 = 0.5 (50% of tuples).
Uniformity Assumption: Assumes equal distribution of values in a column.
Practical Example:
- Table R with 100 tuples, attribute with 2 distinct values.
- Expected Result: With a reduction factor of 0.5, expect to retrieve 50 tuples (50%).

Note: Reduction factor calculation assumes uniform distribution of attribute values.

Question 3

Q

What are the different types of Selection Operations and their reduction factors in databases?

Answer

A

Basic Selection (σA = c(R)):
- Reduction Factor: 1 / V(R,A), where V(R,A) is the cardinality of attribute A in R.
Selection with ‘Greater Than’ Condition (σ_A>c(R)):
- Non-Arithmetic Types: Estimated reduction factor is 1 / 3.
- Arithmetic Types: Calculated as ((highVal(A) - c) / (highVal(A) - lowVal(A))) * T_R).
Selection with ‘Not Equal’ Condition (σA ≠ c(R)):
- Reduction Factor: Typically taken as T(R), the total number of tuples in R.
Selection with Disjunction (σC1 ∨ C2(R)):
- Size estimate: min {T(R), size(σC1(R)) + size(σC2(R))}.
Selection with Conjunction (σC1 ∧ C2(R)):
- Size estimate: rfC1 x rfC2 x T(R), assuming independence between C1 and C2.

Question 4

Q

How do you estimate the result size for the union of two tables in a database?

Answer

A

Maximum Size: Sum of the tuple counts of both tables, T_R + T_S.
Minimum Size: Occurs when all tuples in one table are also in the other. Calculated as max{T_R, T_S}.
Estimated Size Formula:
- Estimated size is the average of the maximum and minimum sizes.
- Formula: ((T_R + T_S) + max{T_R, T_S}) / 2.

Question 5

Q

How is the result size estimated for the intersection of two tables in a database?

Answer

A

Minimum Size: 0, assuming no overlap between the tuples of the two tables.
Maximum Size: Smaller of the two tuple counts, min{T_R, T_S}.
Estimated Size Formula:
- Average of minimum and maximum sizes.
- Formula: (0 + min{T_R, T_S}) / 2.

Question 6

Q

How is the result size estimated for the difference of two tables (R−S) in a database?

Answer

A

Minimum Size: T_R - T_S, assuming S completely subsumes R.
Maximum Size: T_R, occurring if no tuples in R are in S.
Estimated Size Formula:
- Average of the minimum and maximum sizes.
- Formula: |R - S| = (T_R + (T_R - T_S)) / 2
  = T_R - (½) T_S

Question 7

Q

How is the result size estimated for a join on attribute Y between two tables? Which assumptions are made about the data?

Answer

A

Basic Estimation (Uniformity & Independence):
- For tables R and S joined on attribute Y (R.y = S.y).
- Formula: = (1 / V(Y, R)) * T_R * (1 / V(Y, S)) * T_S
- V(Y, R) and V(Y, S): Column cardinalities of Y in R and S.
- TR and TS: Total number of tuples in R and S.

Summing Over All Values of Y:
- Extend calculation over all joinable values of Y by summing the formula.

Primary Key - Foreign Key Join (PK-FK Join):
- If Y is a PK-FK join, size is determined by distinct values of Y in the FK relation.
- Estimated Result Size: min(V(Y,R), V(Y, S)) * (1 / V(Y, R)) * T_R * (1 / V(Y, S)) * T_S.
- Assumes FK values in one table are a subset of PK values in the other.

Formalization for PK-FK Join:
- |R ⋈ S| = min{ T_R (1 / V(Y, R)), T_S (1 / V(Y, S)) }

Assumptions:
- Independence of columns.
- Uniform distribution of values in a single column.
- Inclusion assumption for FK: all FK values present in PK column.

Question 8

Q

What are histograms and their key characteristics in database systems?

Answer

A

Definition: Histograms are data structures used for approximating the distribution of values for an attribute in a database.
Purpose: They help in estimating reduction factors more accurately than basic estimation techniques.
Key Characteristics:
- Size: Typically small, designed to fit on a single disk page.
- Accuracy: Generally, histograms have an error margin of less than 5%, striking a balance between accuracy and computational efficiency.

Question 9

Q

What are Equi-Width Histograms and how can they be used for cardinality estimation?

Answer

A

Description: Histograms that divide the range of an attribute A into equal sub-ranges or buckets.

Implementation:
- Example: For a range 1-100, create 10 buckets each covering a range of 10 units (1-10, 11-20, etc.).

Cardinality Estimation:
- Estimate output cardinality of a range query by summing the number of tuples in relevant buckets.
- Assumes uniform distribution within each bucket.
- Example: For a query like ‘age > 35’, if the bucket 30-40 has 100 tuples, estimate 50 tuples for the range 35-40.

Question 10

Q

What is the complexity and error of Equi-withh histograms?

Answer

A

Complexity:
- Construction: Single pass, O(N) time, O(B) size.
- Range-query Processing: Need to scan the histogram (B).
- Maintenance: Determine the bucket and increment/decrement in O(1) time.

Error:
- Error: Can be arbitrarily bad; the error cannot be limited.

Question 11

Q

What are the limitations and potential errors in using Equi-Width Histograms in databases?

Answer

A

Limitations:
- Assumes uniform distribution within each bucket, which might not always hold true in real data distributions.

Potential Error:
- Error in cardinality estimation can be arbitrarily bad due to uneven distribution of values within buckets.
- Unable to limit the error margin effectively in diverse data distributions.

Question 12

Q

What are Equi-depth Histograms and how can they be used for cardinality estimation?

Answer

A

Title: Understanding Equi-Depth Histograms in Databases

Front:
What are Equi-Depth Histograms and how do they function in databases?

Back:
- Description: Histograms dividing the range of an attribute A into sub-ranges where each bucket contains approximately the same number of tuples.
- Cardinality Estimation:
- Use buckets to approximate the number of tuples within a query range.
- Difference from Equi-Width:
- Buckets are defined based on tuple count, not value range.

Question 13

Q

What is the complexity and error of Equi-depth histograms?

Answer

A

Complexity:
- Construction: Requires sorting, O(N log N) time, O(B) size.
- Range-query Processing: Scan the histogram, O(B).
- Maintenance: Reconstruct each time or adjust bucket counts, merge/split as needed.

Error:
- Estimated error can be at most ±|R|/B, providing a more predictable error margin.

Question 14

Q

What are the important considerations when using histograms in database systems?

Answer

A

Bucketing Scheme:
- How to partition: Buckets can overlap or be recursive to better reflect data structure.

Bucket Statistics:
- What to store: Beyond bucket boundaries and counts, additional information can be included.

Intra-Bucket Approximation Scheme:
- How to estimate answers: Typically uses a continuous-value assumption for frequency counts, but other methods are also viable.

Class of Queries Answered:
- Suitable for range-count and point-count queries.
- Not applicable for general SQL; only specific query types are supported.

2. Cardinality Estimation and Histograms Flashcards

(14 cards)