WEEK 8 Flashcards
Scenario S1
Access Routine S1
Scenario S2
Access Routine S2
Scenario S3
Access Routine S3
- Can use hash search
- Since we are doing equality search
- On a key field
a) Hash the search value to find pointer
to the bucket
b) May involve search of overflow
buckets
Scenario S4
- In our sample data, all values for age are unique, but in a realistic data set we would not expect this to be so
- In this case we may be retrieving multiple records
- S4 Primary index, may be multiple records if condition involves <, <=, =>, > on the search field then we find record satisfying the equality condition (age = 36) and then retrieve all subsequent records in ordered file
Scenario S5
Access Routine S5
* Clustering index search
* One index entry for each distinct value of clustering field
* Binary search of index file to find first block of data containing matching search record
Scenario S6
- Secondary index (B+ tree) equality and others;
- multiple records may match
- Operates in much the same way as S4
- Use index to find required record(s) in main file
- This time traverse tree rather than search
ordered file
- This time traverse tree rather than search
- Retrieval behaviour depends on operator
- < : Find first occurrence of search value, retrieve all preceding
records - > : Find last occurrence of value, retrieve all subsequent records
- <= : Find last occurrence of value, retrieve it and all preceding
records - > = : Find first occurrence of value, retrieve it and all subsequent
records
S6 example - consider leaf nodes
- Remember a B+ tree usually has pointer from each leaf block to next leaf block
- Once first matching record located, can traverse as a linked list to obtain other records for >= comparison
- If nodes have back pointer, can traverse as doubly linked-list for <= comparison
What is the Conjunctive Selection Conditions
- Boolean expressions involving AND operator
- General form
- (simple condition) AND (simple condition) AND…
- S7: If we can use one of methods S2-S6 for an attribute in one of the simple conditions; do so; examine the records thus retrieved to see if they meet other conditions
-S8: Composite index: if two or more attributes involved in equality conditions also appear in a composite index, use the index directly
Scenario S9
Access Routine S9
- Intersection of record pointers: If secondary indices available on all fields involved in equality conditions and indices include record pointers (i.e. inverted files): retrieve set of record pointers from each index involved and take intersection as final result set
Optimisation
- Query optimisation for select needed for conjunctive conditions when different simple conditions offer an access path
- Access path that results in the fewest record retrieval operations should be chosen
Disjunctive conditions (OR)
- Little can be done to optimise
- If any one of the simple conditions involve an attribute with no access path, we must use serial access (brute force)
Join Operations
- One of the most time consuming operations in query processing
- M*N operation
a. Cartesian product followed by a select
Methods for implementing Joins
J1: Nested-loop join
- Default (brute force) algorithm
- Used when no access paths available on
either file in the join
- For each record t in R (outer loop)
- Retrieve every record s from S (inner loop)
- Test whether the two records satisfy the join condition
- t[A] = s[B]
Explain J2: Single loop join
Using an access structure to retrieve the matching records
a) If index (or hash key) exists for one of the two join attributes e.g. B or file S
- Retrieve each record of t in R (loop over file R)
- Use access structure for B to retrieve all matching records from S that satisfy s[B] = t[A]
Explain J3: Sort-merge join
- Possible if records of R and S are physically ordered by the value of the join attribute A, B
- Can implement join in most efficient way possible
- Both files are scanned concurrently in order of the join attributes
a. Match up records that have same values
for A and B - Pairs of file blocks copied into memory buffers in order and records of each file only scanned once
Explain J4: Hash-join
- Records (or record pointers) of R and S are hashed to same hash file using same hash function on attributes R[A] AND S[B] as hash keys
- Single pass through each file fills the hash buckets
- Each bucket then examined from records R and S that match to produce the join result
PROJECT Operations
If <attribute> include a key of R, then result has same number of tuples as R but only the attributes given in <attribute></attribute></attribute>
- If not, we must detect duplicate tuples and eliminate them
- Two possible elimination techniques are:
- Sort the result and then eliminated duplicates (which will now be duplicate tuples)
- Hash each projected tuple,check against those already in the bucket and discard if a double
- Remember in SQL, we use the DISTINCT keyword to force the elimination of project duplicates
Composite index
- Composite (or concatenated) index is one based on more than one attribute. It can be clustering or otherwise
- E.g. Person (NI_No, Surname, Forename, DOB…)
- We can specify a composite index on surname, forename
- Advantages over single attribute indexes:
a. Consider the query “how many people are there with last name ‘Smith’ and first name ‘John’?
b. A query on all attributes of a compound query will likely return far fewer records than a query on only some of those attributes.
Query processing
- What happens when we execute a query on a database?
- Like all languages, high level statements must be interpreted (or compiled) prior to execution
- Runtime database processor performs sequence of operations before returning query result
Query processing
Query process operations
Scanning, parsing and validation
- Scanner
- Identifies language tokens such as SQL keywords, attribute names, relations etc in text of query
- Parser
- Checks query syntax to determine if it is formulated according to the grammar rules of
the query language - Validation
- Check all attribute and relation names are valid and semantically meaningful names in
the database schema - Internal representation of query
- Usually a query tree, may be a query graph
Query processing
Query Optimiser, Code Generator, Runtime database processor
- Query optimiser
- Responsible for devising an execution plan for retrieving the result of the query from the DB
- Query can have several possible execution plans, optimiser chooses a suitable one
- Code generator
- Generates the code to execute the plan produced by the optimiser
- Runtime database processor
- Runs query code to produce the query result
- Code may be executed in compiled or interpreted mode
- Error message generated if runtime error occurs
Query optimisation
Process of selecting a suitable execution plan for the query, Three Main Techniques
- Process of selecting a suitable execution plan for the query
- May or may not be ‘optimal’ plan
- Seeks to provide a reasonably efficient strategy
- Finding optimal strategy is usually too costly except for simplest of queries
- Three main techniques
- Cost based
- Semantics based
- Heuristic based
- May be used in conjunction with each other
Cost based optimisation
- Search solution space to find the solution that minimises cost
- Compare costs of executing a query using different execution strategies and choose the lowest cost estimate
- Cost functions are estimates, so a solution that is not the optimal may be chosen
Cost based optimisation
* What is cost and how do we calculate it?
Cost: The amount of time to complete an operation
Cost based optimisation
Cost components
- Access cost to secondary storage
- Storage cost
- Computation cost
- Memory usage cost
- Communication cost
Cost components
Access cost to secondary storage
- Cost of searching for, reading and writing data blocks residing on
secondary storage - Depends on type of access structures on the datafile
- Also on whether block allocation is contiguous (more relevant for HDD than SSD)
Cost components
Storage cost
Cost of storing intermediate files generated by an execution strategy
Cost components
Computation cost
- Cost of performing in memory operations on data buffers during execution
- Include operations such as searching for and sorting records, merging records for JOIN and performing computations on field values
Cost components
Memory cost
Cost pertaining to number of memory buffers needed during execution
Cost components
Communication cost
- Cost of shipping query and it’s results between originating client and database (and any inter-database communication)
Cost minimisation aims for differing
scenarios
- Large databases
- Minimise access cost to secondary storage
- Minimise no of block transfers between disk and main memory : reduce volume of data moved
- Smaller databases
- Minimise computation cost
- Most of data involved in query can be stored completely in main memory
- Distributed databases
- Minimise communication cost
Semantic query optimisation
- Use constraints specified on the schema to inform query execution
- E.g. unique attributes, nullable fields
- Consider:
- SELECT * FROM Customers WHERE CName IS NULL
- If Cname is declared as NOT NULL then this query will never return any results
- Semantic optimisation says don’t run the query at all!
- As well as schema constraints, custom constraints can be implemented
Heuristic based optimisation
Query process operations
Intermediate query form is usually a ‘query tree’ whose leaves are the relations involved in the query and whose nodes are the RA operators
Query tree example - sample data
Canonical query tree
Query tree notation
- Leaf nodes: input relations of query
- Internal nodes: RA operations
- Execution of query tree
- Execute an internal node operation whenever it’s operands are available
- Replace internal node by resulting relation
- Execution terminates when route node executed and query result is obtained
Relational operators as a query mechanism
Multi relational queries
* Data we require may not be present in any single existing relation.
* The relational operators allow us to transform and combine existing relations in order to produce a single relation which holds the answer to our question
* “What are the names of the crew members on board the C-37D?”
- “What are the names of the crew members on board the C37D?”
- This involves the Pnames and Snames fields; we do not currently have a single relation holding these fields. In order to answer this query, we must combine all three relations in the database portion.
- Personnel -> PID, Pname, Rank, Bplace
- Ship -> SID, Sname, Class, Crew
Optimisation of high level query language
expressions
- Low level navigational languages (hierarchic and network) hand optimised by the programmer.
- High level relational languages Declarative (what rather than how) requires optimisation
Heuristic optimisation of query trees
- Main (most obvious?) heuristic is to apply select and project before joins
- Select and project usually reduce the size of a relation, they never increase it
- As join is M * N operation where M,N are size of relations involved, decreasing M and/or N is an optimisation
Heuristic Optimisation of query trees
Example
Intermediate step
- Second two splits are more complex
- They refer to two relations each
- Can’t apply these directly
- Have to first combine the two relations they refer to
End of step 1
Step2
Final steps
* Finally we can replace any cartesian product operations with JOIN operations
General heuristic rules
- Select and Project operations moved down the tree
- Most restrictive joins and selects also moved down
- Project optimisations are also applied (not shown)
Heuristic query optimisation
Summary
- Previous example shows that a query tree can be transformed step by step into another query tree that is more efficient to execute
- Must make sure that transformation steps always lead to an equivalent query tree
- Query optimiser needs equivalence preserving transformation rules
- See Elmasri and Navathe for list of a number of transformation rules and a simple optimisation algorithm employing those rule