Query processing Flashcards
How are each of the acid properties fulfilled?
A: undo/redo logging
C: serialisable schedules
I: serialisable schedules and isolation levels
D: redo/undo logging, recoverable schedules
What is query processing used for?
Here we’re taking in queries and making a sequence of operations we can actually do on a database and then executing these on the physical hard disk in the execution engine
What does it mean that SQL queries are declarative
They tell the DBMS what we want, not how to get it
How Does Query Processing Work?
- Preprocessing might involve figuring out how tables are involved etc..
- Logical query (for the purpose of this course is relational algebra)
-Optimising by modifying it.
-physical query plan involves choosing between different algorithms
What is a query plan?
A relational algebra expression that is obtained from an SQL Query
- they tell us exactly how to compute the result to a query
How are query plans typically represented?
- As trees
- leaves are input relations
- inner nodes are operators
- they are evaluated from the leaves to the root (so bottom-up approach)
Why do DBMSs use relational algebra over SQL for query plans?
- because we can then use equivalence laws to generate more optimised query plans
How do we apply the select operation on a relation R
For each tuple t in R, if t satisfies condition, output t
- this requires reading the whole file
- so is very slow
How is a naive computation of a natural join carried out?
- A nested loop is used to do the natural join, so for each tuple in a table you look at each tuple in the other table and if they have the same value for all the common attributes you output the tuple obtained by combining these two tuples (rows)
- for each tuple you have to read the entire file on the other relation to check if they match
- this is very very slow
What is the run time of the Naïve Computation of Joins on relations S and R
|R| * |S| = number of tuples in R times number of tuples in S
What are equijoins?
Equijoins are just a variant of natural joins, but equijoins allow you to have two columns with the same output whereas natural join removes one of them
What is the run time of an equijoin?
- |R| + |S| + run time equal to size of the output
What are the rules for doing an equijoin?
- start at first row in attribute A, if the values of A and B match, we output this to the joined table, if they don’t match, we increase A (as in go to the new row down) until it’s equal to B
- when A and B are equal, we output this to the joined table, then we increase B
- if when we increase B they are still equal we output this to the joined table. If they’re not equal we go back to increasing A.
What happens if we’re doing the equijoin of R and S and column A has a lot of duplicated?
The size of the output will be very big (not efficient)
What sorting algorithm do we use to sort R by A and S by B and how does this affect the run time?
- we use a merge sort which gives a run time of O(log2(R))
- much faster than a nested loop
What is the run time of an equijoin (including sorting) if there’s not too many duplicates in A?
O(|R|log2|R| + |S|log2|S|)
- we can leave out the size of the output because this is linear so will be insignificant as the size of R and S increase (only if there aren’t too many duplicate rows in A)
Which algorithm is typically faster? merging sort algorithm or nested loop?
Merging sort algorithm
- exponential as opposed to quadratic
Why is it important that we take into account that relations are stored on the hard disk?
- RAM is often no big enough to hold all of a database in it at once
- though each time you access the disk it takes a lot of time because you have to find the exact location of the item you need
Can we read an entire relation in a faster way?
Yes, selection can be done faster if we know where to find the rows for G401
- this can be done by sorting and indexing
What does an index do?
an index addresses one or more attributes and corresponds to some table.
What is a primary index?
It points to the location of record on the disk and the data is sorted based on value
- typically the primary index is exactly what you define with primary keys
What is a secondary index?
- It points to the location of records on a disk
What is the run time in regards to searching for an item in the table of indexes?
- We can use a binary search for this which has time log2K where K is the size of the table
What is the run time for then going through each row that is listed in the index?
- this is just the number of items associated to the index specified
How can we make querying even faster if we’re using an index and the condition specifies two different attributes?
- instead of just using an index for only one of the items then having to go through all of those rows
- we can search through two different index tables, each one applying to one of the two specified attributes, then take the intersection of the union of the rows selected from both index tables.
What are advanced indicies?
Index tables that feature indexes that apply to multiple attributes
What are the two common types of indexes? and what are the SQL commands used for them?
- B+ Trees
CREATE INDEX ON students USING btree (programme, year) - Hash Tables
CREATE INDEX ON students USING hash (lower(name));
When is it good to use B+trees for indexing?
- If the selection we are after specifies the range (and this is one of the most widely used index)
- so for instance a range such as having the attributes program = G401 and year < 1
B+ trees are a specific kind of index, what index is this and what does it mean?
- B+ trees are a specific kind of index known as a multilevel index which means there are indexes over indexes.
- In a multilevel index you can have any number of these levels and just branch out more and more.
- Each level i index points to i+1 and the final level of index points out to the hard disk with the actual records where the information is stored.
What is stored in the leaves of B+ trees?
- a leaf is formed of two rows
- On the top row we have values and on the bottom row we have their corresponding pointers which point to tuples with that value. - We also have a next leaf pointer
- These values, that was also the case for the multilevel index, are sorted increasingly in order.