DBMS - Query Optimization and Query Processing Flashcards

1
Q

Query Language

A

Dedicated, High level language for accessing and updating the contents of a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Query Models

A

Closely associated to their underlying data model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SELECT Statement based on what model

A

Relational model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Query Languages are either

A

Declarative or Procedural (RA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If a query is specified in text

A

DBMS transforms it to a number of graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are declarative queries good

A
  • Design the meaning of a query rather than implementation details
  • Shift job of finding a better implementation to runtime
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Disadvantage of Declarative queries

A

DBMS needs to dedicate more resources to compile

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Declarative constructs must be converted to

A

Procedural constructs using RA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are there any declarative queries that cannot be converted to procedural?

A

SELECT and GROUP BY for example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Canonical Translation from query to RA

A
  1. Run a sequence of products through all tables listed in the FROM clause
  2. Sift result with the WHERE clause conjuncts
  3. Project the output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can the canonical translation be proven correctly

A

Yes and easily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Access Plan

A

Also referred to instead of the canonical translation and any RA expression in Query Processing

Access Plans are procedural and follow a data flow mechanism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Query Optimizer’s Role

A

To manipulate access plans into better plans. Better is not neccesarily the best since QO is np hard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the output of a query optimizer called

A

Query Execution Plan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DBMS components of QO and QP

A

Query Language interface, buffer manager (RAM buffer), Storage Manager, File Structures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

QO Environment of Interest

A
  • One storage manager
  • If multicore, parallelism is provided by other layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Compilation of a non procedural query have some added procedures, ex:

A
  • Typical scanning, parsing, type checking
  • QP and QO converts query into detailed representation (query graph) and choosing execution plan that safely and efficiently implements data access.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

QO Basic Plan

A

In Non-Trivial queries, there exists a number of possible solutions. For general queries it is an np problem. Heuristics and statistics may be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When a solution requires extensive work

A

base a quick solution on heuristics to offer an approximate solution that effectively has limited optimality and completeness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Search space is

A

all the possible permutations of a query. Huge for most general queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Algebraic select can be implemented with

A

serial scans, B Tree, hash, index…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to decide which options to prune in search space

A

Use statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why do we use heuristics and statistics

A

To evaluate potential of access plans to be selected as the optimal execution plan. QO must search and generate access plans, estimate their cost of parts and whole plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

No optimizer really optimizes

A

SQL QO can give good query execution plans, but not guaranteed to be the best.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Output of QO
passed to query processor to execute
26
Normalization
Assume the input query is restricted. - Does Lexical and syntax analysis, grammar and type checks - If good query is converted to normal form (PRENEX) - Query can be transformed to CNF or DNF
27
CNF
Prefers AND
28
DNF
Prefers ORs
29
PRENEX
A query from where all quantifiers precede a quantifier free qualification.
30
Simplification
One can use different queries that will return the same result quicker for simplification. Use boolean rules.
31
Semantic Analysis
Refute incorrect queries (ex. contradictions)
32
An example of wrong semantics
Forgetting to use JOIN in a multi table query
33
How to generate canonical relational algebra expression
Each algebra construct shown by a relational algebraic tree. Data flow directed from leaves to root
34
RAT
A relational algebra tree describes a query where a leaf is a base relation, an internal node represents an intermediate relation obtained by applying a relational operation, and root represents result of query.
35
RAT restructuring
Can manipulate RATs to get an easier structure to compute an efficient access plan for. Uses algebraic restructuring rules. Although the expression is semantically identical, the cost may be different.
36
Algebraic restructuring rules
Based on the algebraic properties of the operands; commutativity + associativity.
37
Two common meta transformations on RAT restructuring
Pushing Down / Pushing Up the tree
38
QO aided with costs
Choose processing strategy that aligns with given cost function.
39
Cost function
Uses details of physical DB and statistics about database. - Number of basic DB ops are directly supported - A query execution strategy dictates access to base relations, choice of algorithms for basic ops and ordering of ops.
40
Process Trees
Can help model execution strategies. Tree based modeling representations with base tables as leaves and basic operation holding the internal nodes. PT helps us quantify cost of intermediate relations from basic ops and potentially exclude some.
41
How many graphs are created for each query in a Processing Tree
2-3 graphs
42
What is a variant of a process tree
JOIN processing tree.
43
Selectivity
Defined as the ratio of number of records that satisfy the condition to total number of records in the file relation (between 0 and 1)
44
Estimates of selectivity
Kept in DBMS catalog and used by optimizer FOR PK: S1 - 1/[table tuples] FOR Secondary Key: s1 = 1/i
45
Data distribution if non-PK values
- Uniform is ideal - If not, can have serious repercussions - In last decade, DBMS keep statistics through histogram of attribute data's dist.
46
Selectivity cond1 and cond2
s1(c1) * s1(c2)
47
Cost Estimation
Traverse PT and sum each operation's CPU and IO cost. Cost of intermediate results based on statistical info regarding the physical relations and formulas that predict the cardinality of the result of each operation type
48
Basic Statistics in cost estimation
- Domain cardinality (each attribute) - Num. of distinct values present for each attribute - The min and max values of each numerical attribute - Cardinality of each relation
49
Selectivity factor
Proportion of tuples in a DB that satisfy a given condition. Difficult to predict. Most optimal plans are insensitive to the inaccuracy of the join selectiveness.
50
System R approach
Exhaustive search approach performs static optimisation based on statistical info.
51
Algorithm Factors of System R
- Costs CPU, IO ops and existing paths - Restricts push selection down heuristic - Considers the "interesting ordering" of the result tuple
52
System R; Choosing best candidate
- Ignore PT with cartesian product - Select PT that have the cheapest join - Select and join at worst are lumped together
53
Cost of Formula (system R)
IOs + Write*instruction
54
Algorithm of System R
- Cost effective - Complex
55
JOINS
associative and commutative. Prefer left deep processing trees.
56
Sorting is misused
QO might choose an ordered data placement processing tree knowing that eventual operations will benefit.
57
Operators Implementation (Relational)
Not 1-1 but useful and more common combinations of RA operands.
58
Do access method implementations exist for constructs with no algebraic equivalent
Yes ex. aggregate queries
59
Costing Operators
- File scans + Index Scans (Block Based) Complexity defined wrt. relational cardinality. Ignore indexing, clustering and merging for now. Basic Binary and Unary algebraic operands.
60
Serial File Scan of Table
Read each file block in main memory and extract every tuple. Check each tuple against comparison condition.
61
Binary search over sequential file
use the binary chop if searching attribute matches the binary tree ordering attribute.
62
Primary Key Index search over serial file
Index to pinpoint a single record if searching attribute matches primary index ordering attribute. Sequential scan for all valid records.
63
Cluster based retrieval of a number of records
If search attribute matches the clustering key, then retrieve all blocks with same value
64
What do we do to select conjunctive queries
If an indexed approach exists for each conjunct then go for it and compare all conjuncts results. Otherwise we need to use a full table scan.
65
Nested (Inner-Outer) Loop Join
For each tuple in 1st table, serially access each tuple in 2nd table and form joined tuple if they both match their common attribute values. COST: T1 + (T1*T2) block IO ops
66
Sort-Merge Join
First sort each table on their respective matching attributes, then merge the two sorted tables by matching the values from both tables. COST: (2*T1*logT1)+(2*T2*logT2)+T1+T2
67
Hash Join
First phase entails hashing the first and second tables on the joining attribute. When inserting second table keys, check for key matching with first table entries
68
Semi-Join
Replaces the join of two relations by the join of their semi-joins, same complexity as selections
69
Join with an Index
Indexes on the join attributes of the two relations.
70
Projects
- If duplication elimination not needed, then projection is straightforward and expanse is proportional to the cardinality of a relation - If duplication elim. is needed then weed out duplicates with sorting or hashing algorithms
71
Set Operands
Like, Cartesian Product, Union, Set Diff, Intersection. Expensive to implement as extensive sorting/hashing is needed
72
External Sorts
Heavily used in QP. Occurs in GROUP BY, ORDER BY, JOIN...
73
Popular External Sort
Sort-Merge algorithm, depends on RAM buffers which hold an unpacked sort block
74
Costs of external sorting a file of blocks with buffer space nB is
- Sorting phase and sort internally - nR=roundup(b/nB) - Merging phases goes over the sorted runs and merged in a number of passes. dM is the number of sorted files that can be merged in one pass - Total costs (2*b)+(2*b*(logdM(nR))
75
Aggregate
If an aggregate is needed on an attribute that is indexed then if index is ordered and dense we can pick end points for min+max.
76
Group By
We can use both sorting and hashing
77
Observations
- First RDBMS had big QP efficiency problems - QP goal is to reduce time taken for main memory computation and IO ops for non-procedural queries. - Good, up to date DB stats help QO a lot - QO processes involving general queries with joins, unions and aggregate functions are much harder to tackle
78
Troublesome queries, How to analyze statement's execution
- Are table's indexes being used where they should be? If not then WHERE clause may be wrong - Are indexes being used when they should not be? Use FULL table scan hint. - What is cost of SQL Statement? See the value given in the position column of the first row of the explained SQL construct in the "explain plan" - Amount of overhead from SELECT or UPDATE. - Is statement using parallel architecture.