2. Cardinality Estimation and Histograms Flashcards

1
Q

How is the reduction size calculated in database projections?

A
  • Projection: Selecting a subset of columns from a table.
  • Old Tuple Size: Total size of a tuple with all original columns.
  • New Tuple Size: Total size of a tuple after projection (only selected columns).
  • Reduction Factor: Measures size reduction due to projection.
    • Formula: Reduction Factor = New Tuple Size / Old Tuple Size
  • Example: If the original tuple size is 100 bytes and the new tuple size (after selecting half of the columns) is 50 bytes, then the Reduction Factor is 50 / 100 = 0.5. This indicates a 50% reduction in size.

Tip: The reduction factor depends on the number of columns selected and their individual sizes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Selection Operation in databases and how is its reduction size calculated?

A
  • Selection Operation: Used to filter rows in a database based on a specific condition, denoted as σ. It selects tuples that satisfy a given criterion.
  • Notation: σA=c(R) selects rows from table R where attribute A equals value c.
  • Reduction Size Formula:
    • Formula: Reduction Factor = 1 / V(R,A) * T(R)
    • V(R,A): Column cardinality (number of distinct values) of attribute A in table R.
    • T(R): Total number of tuples in R.
  • Cardinality & Reduction Factor Example:
    • If attribute A (e.g., ‘gender’) has 2 distinct values (‘male’ and ‘female’), its cardinality V(R,A) is 2.
    • For σgender=’male’(R), Reduction Factor = 1 / 2 = 0.5 (50% of tuples).
  • Uniformity Assumption: Assumes equal distribution of values in a column.
  • Practical Example:
    • Table R with 100 tuples, attribute with 2 distinct values.
    • Expected Result: With a reduction factor of 0.5, expect to retrieve 50 tuples (50%).

Note: Reduction factor calculation assumes uniform distribution of attribute values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the different types of Selection Operations and their reduction factors in databases?

A
  • Basic Selection (σA = c(R)):
    • Reduction Factor: 1 / V(R,A), where V(R,A) is the cardinality of attribute A in R.
  • Selection with ‘Greater Than’ Condition (σA>c(R)):
    • Non-Arithmetic Types: Estimated reduction factor is 1 / 3.
    • Arithmetic Types: Calculated as ((highVal(A) - c) / (highVal(A) - lowVal(A))) * TR).
  • Selection with ‘Not Equal’ Condition (σA ≠ c(R)):
    • Reduction Factor: Typically taken as T(R), the total number of tuples in R.
  • Selection with Disjunction (σC1 ∨ C2(R)):
    • Size estimate: min {T(R), size(σC1(R)) + size(σC2(R))}.
  • Selection with Conjunction (σC1 ∧ C2(R)):
    • Size estimate: rfC1 x rfC2 x T(R), assuming independence between C1 and C2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you estimate the result size for the union of two tables in a database?

A
  • Maximum Size: Sum of the tuple counts of both tables, TR + TS.
  • Minimum Size: Occurs when all tuples in one table are also in the other. Calculated as max{TR, TS}.
  • Estimated Size Formula:
    • Estimated size is the average of the maximum and minimum sizes.
    • Formula: ((TR + TS) + max{TR, TS}) / 2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is the result size estimated for the intersection of two tables in a database?

A
  • Minimum Size: 0, assuming no overlap between the tuples of the two tables.
  • Maximum Size: Smaller of the two tuple counts, min{TR, TS}.
  • Estimated Size Formula:
    • Average of minimum and maximum sizes.
    • Formula: (0 + min{TR, TS}) / 2.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is the result size estimated for the difference of two tables (R−S) in a database?

A
  • Minimum Size: TR - TS, assuming S completely subsumes R.
  • Maximum Size: TR, occurring if no tuples in R are in S.
  • Estimated Size Formula:
    • Average of the minimum and maximum sizes.
    • Formula: |R - S| = (TR + (TR - TS)) / 2
      = TR - (½) TS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is the result size estimated for a join on attribute Y between two tables? Which assumptions are made about the data?

A

Basic Estimation (Uniformity & Independence):
- For tables R and S joined on attribute Y (R.y = S.y).
- Formula: = (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS
- V(Y, R) and V(Y, S): Column cardinalities of Y in R and S.
- TR and TS: Total number of tuples in R and S.

Summing Over All Values of Y:
- Extend calculation over all joinable values of Y by summing the formula.

Primary Key - Foreign Key Join (PK-FK Join):
- If Y is a PK-FK join, size is determined by distinct values of Y in the FK relation.
- Estimated Result Size: min(V(Y,R), V(Y, S)) * (1 / V(Y, R)) * TR * (1 / V(Y, S)) * TS.
- Assumes FK values in one table are a subset of PK values in the other.

Formalization for PK-FK Join:
- |R ⋈ S| = min{ TR (1 / V(Y, R)), TS (1 / V(Y, S)) }

Assumptions:
- Independence of columns.
- Uniform distribution of values in a single column.
- Inclusion assumption for FK: all FK values present in PK column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are histograms and their key characteristics in database systems?

A
  • Definition: Histograms are data structures used for approximating the distribution of values for an attribute in a database.
  • Purpose: They help in estimating reduction factors more accurately than basic estimation techniques.
  • Key Characteristics:
    • Size: Typically small, designed to fit on a single disk page.
    • Accuracy: Generally, histograms have an error margin of less than 5%, striking a balance between accuracy and computational efficiency.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Equi-Width Histograms and how can they be used for cardinality estimation?

A

Description: Histograms that divide the range of an attribute A into equal sub-ranges or buckets.

Implementation:
- Example: For a range 1-100, create 10 buckets each covering a range of 10 units (1-10, 11-20, etc.).

Cardinality Estimation:
- Estimate output cardinality of a range query by summing the number of tuples in relevant buckets.
- Assumes uniform distribution within each bucket.
- Example: For a query like ‘age > 35’, if the bucket 30-40 has 100 tuples, estimate 50 tuples for the range 35-40.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the complexity and error of Equi-withh histograms?

A

Complexity:
- Construction: Single pass, O(N) time, O(B) size.
- Range-query Processing: Need to scan the histogram (B).
- Maintenance: Determine the bucket and increment/decrement in O(1) time.

Error:
- Error: Can be arbitrarily bad; the error cannot be limited.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the limitations and potential errors in using Equi-Width Histograms in databases?

A

Limitations:
- Assumes uniform distribution within each bucket, which might not always hold true in real data distributions.

Potential Error:
- Error in cardinality estimation can be arbitrarily bad due to uneven distribution of values within buckets.
- Unable to limit the error margin effectively in diverse data distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are Equi-depth Histograms and how can they be used for cardinality estimation?

A

Title: Understanding Equi-Depth Histograms in Databases

Front:
What are Equi-Depth Histograms and how do they function in databases?

Back:
- Description: Histograms dividing the range of an attribute A into sub-ranges where each bucket contains approximately the same number of tuples.
- Cardinality Estimation:
- Use buckets to approximate the number of tuples within a query range.
- Difference from Equi-Width:
- Buckets are defined based on tuple count, not value range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the complexity and error of Equi-depth histograms?

A

Complexity:
- Construction: Requires sorting, O(N log N) time, O(B) size.
- Range-query Processing: Scan the histogram, O(B).
- Maintenance: Reconstruct each time or adjust bucket counts, merge/split as needed.

Error:
- Estimated error can be at most ±|R|/B, providing a more predictable error margin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the important considerations when using histograms in database systems?

A

Bucketing Scheme:
- How to partition: Buckets can overlap or be recursive to better reflect data structure.

Bucket Statistics:
- What to store: Beyond bucket boundaries and counts, additional information can be included.

Intra-Bucket Approximation Scheme:
- How to estimate answers: Typically uses a continuous-value assumption for frequency counts, but other methods are also viable.

Class of Queries Answered:
- Suitable for range-count and point-count queries.
- Not applicable for general SQL; only specific query types are supported.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly