9. Query Optimization Flashcards
The main goal of query optimization
finding the query plan that minimizes the number of I/Os it takes to execute the query.
What is a query plan? Which language is used to express it?
A query plan is just a sequence of operations that will get us the correct result for a query. We use relational algebra to express it.
Operator’s selectivity - def.
Why do we need to know it?
The selectivity of an operator is an approximation for what percentage of pages will make it through the operator onto the operator above it.
We need to count it to estimate a query plan’s cost.
Selectivity formulas for the following selection conditions:
X=a, X=Y, X>a, cond1 AND cond2
capital letters are for columns and lowercase letters represent constants
- X=a: 1/(unique vals in X)
- X=Y: 1/max(unique vals in X, unique vals in Y)
- X>a: (max(X) - a) / (max(X) - min(X) + 1)
- cond1 AND cond2: Selectivity(cond1) * Selectivity(cond2)
Selectivity and the total number of tuples of equijoin
(tables A and B on the condition A.id = B.id)
Total number: |A|∗|B| / max(unique vals f or A.id, unique vals f or B.id)
Selectivity: 1 / max(unique vals f or A.id, unique vals f or B.id)
3 main heuristics to prune non-optimal query plans
- Push down projects (π) and selects (σ) as far as they can go
- Only consider left deep plans
- Do not consider cross joins unless they are the only option
Why query optimizers commonly consider only left deep plans?
these plans can be fully pipelined, meaning that we can pass the pages up one at a time to the next join operator – we don’t actually have to write the result of a join to disk.
What is a System R?
The System R Optimizer is the cost-based query optimizer. It pioneered several optimization techniques, including using dynamic programming for bottom-up join tree construction, and the concept of interesting orderings for exploiting ordering in intermediate results.
Pass 1 of System R -
goal, result
The first pass of System R determines how to access separate tables optimally or interestingly.
Optimally = min number of I/Os
Interestingly = produce output in an interesting (useful for further computations) order
Result: For each table produce a list of access patterns that are either optimal or interesting and prune out others
two options for how to access tables during the pass 1 of system R
- Full Scan
- Index Scan (for every index the table has built on it)
For both of these scans, we only return a row if it matches all of the single table conditions pertaining to its table because of the push-down selects heuristic.
Which orders are considered interesting in analyzing table access plans in System R pass 1?
An interesting order is when the table is sorted on a column that is either:
- Used in an ORDER BY
- Used in a GROUP BY
- Used in a downstream join (a join that hasn’t yet been evaluated. For pass 1, this is all joins).
Can a full table scan produce an interesting order for System R pass 1?
A full scan will never produce an interesting order because its output is not sorted.
Let’s say we are evaluating the following query:
SELECT * FROM players
INNER JOIN teams ON players.teamid = teams.id
ORDER BY fname;
And we have the following potential access patterns:
- Full Scan players (100 I/Os)
- Index Scan players.age (90 I/Os)
- Index Scan players.teamid (120 I/Os)
- Full Scan teams (300 I/Os)
- Index Scan teams.record (400 I/Os)
Which access patterns will be pruned during System R pass 1 and which will move on?
Patterns 2, 3, and 4 will move on. Patterns 2 and 4 are the optimal pattern for their respective table, and pattern 3 has an interesting order because teamid is used in a downstream join.
System R passes 2..n -
the goal, which data do they use as input?
The rest of the passes of the System R algorithm are concerned with joining the tables together. For each pass i, we attempt to join i tables together, using the results from pass i-1 and pass 1.
For example, on pass 2 we will attempt to join two tables together, each from pass 1.
On pass 5, we will attempt to join a total of 5 tables together. We will get 4 of those tables from pass 4 (which figured out how to join 4 tables together), and we will get the remaining table from pass 1.
What is produced by R system pass i?
Pass i will produce at least one query plan for all sets of tables of length i that can be joined without a cross join (assuming there is at least one such set). Just like in pass 1, it will advance the optimal plan for each set, and also the optimal plan for each interesting order for each set (if one exists).