Query Processing (2019) Flashcards
Query Processing:
Basic Steps
-
Query
- User inputs a query
-
Parser and Translator
- Translates query from human readable form to something DBMS understands, such as Relational Algebra Expression
-
Optimizer
- Finds optimized equivalent to RA expression
- Creates Execution Plan
-
Evaluation Engine
- Runs the Execution Plan to perform the query
-
Query Output
- Results returned to user
Query Processing:
Diagram of the Basic Steps
*See Diagram

Query Processing Steps:
Query
User inputs a query in some
human readable language,
such as SQL.
This query is passed to the Parser and Translator
Query Processing Steps:
Parser and Translator
- Recieves Query in some human readable form, such as SQL.
- Performs Syntax Checks
- Translates to a language that the DBMS understands internally(RA)
- Generates Relational Algebra Expression
Query Processing Steps:
Optimizer
- Recieves Relational Algebra Expression from Parser/Translator
- Finds an equivalent statement that is more optimal
- Same results, with less time and resources used
- From this optimal statement, generates an Execution Plan to be run by the Evaluation Engine
Query Processing Steps:
Evaluation Engine
- Receives an Execution Plan from the Optimizer
- Performs it on the Database, obtaining results to be output to the user
- Returns Query Output to the User
Cost of a Query
- Typical Measure is Total Elapsed Time
- Factors:
- Disk Access
- CPU
- Network Communication
- Disk access is usually the “bottleneck”
Disk Access:
How Relations are Stored on Disk
- DBMS stores its data in Files
- Files are broken into Blocks
- The number of tuples per block varies
- B(R) is the number of blocks used for relation R
- T(R) is the number of tuples used in relation R
Three Factors of
Disk Access cost
- Number of Seeks
- Number of Block Reads
- Number of Block Writes
Disk Access:
Why are Block Writes
more expensive than
Block Reads?
A block must be read again after writing,
so the actual time is:
(Time to Write) + (Time to Read)
Seek Time:
Basic Idea
The time spent
spinning the disk
to get to the specified data area
Seek Time:
Two types of Indices on Disk
- Clustered Index
- Data is stored in sorted order
- Data is very close together on disk
- Less Seek Time -> Faster
- Unclustered Index
- Data is stored farther apart
- More Seek Time -> Slower
Types of
External Memory Joins
(Extended) (9)
- Nested Join
- Nested Loop Join
- Block-Nest Loop Join
- Index Nested Loop Join
- Improved Nest Loop Join
- Reservoir Sampling
- Reservoir Join Sampling
- Hash Join
- Partitioned Hash Join
Three Types of
Join Algorithms (External Memory)
Nested Loop Joins
Hash Joins
Sort-Merge Joins
External Memory Join Algorithms:
Nested Loop Joins
General Cost
Nested Loop Joins
cost = O( |R| |S| )
External Memory Join Algorithms:
Hash Joins
General Cost
Hash Joins
Cost = O( |S| )
if |R| < |S|
*I believe R = read cost, S=search Cost
External Memory Join Algorithms:
Sort-Merge Joins
General Cost
Sort-Merge Joins
Cost = O( |S| * log(|s|) )
if |R| < |S|
*I believe R = read cost, S=search Cost
Query Costs:
External Memory Joins
- Happens when at least one relation does not fit into memory
- I/O Cost is the dominating factor
- Call the memory limit M
- Cost is depended on the Memory Join Algorithm used
- Internal Memory Joins are Simple:
- B(s) + B(s)
Nested Loop Join:
Overview
For each block of a relation R, and for each tuple r in block:
For each block of a relation S, and for each tuple s in block:
Output rs if the join condition evaluates to true
- R is called the Outer Table
- S is called the Inner Table
Cost = B(R) +T(R)*B(S)
Memory: 4 Blocks, if double buffering
Block-Nest Loop Join:
Overview
For each block of R and for each block of S:
For each tuple r of R and s of S:
Output rs if join condition evaluates to true.
R is the outer table, S is the inner table
Cost:
B(R) + B(R)*B(S)
Memory:
4 Blocks (If double buffering)
Improved Nest Loop Join
Outer table R and inner table S
- Use available memory, read M-2 blocks from table R
- Read blocks of S sequentially, join with R in Main Memory
Cost:
B(R) + B(R)*B(S) / (M-2)
Memory:
M Blocks
Nested Join
Joining Outer table R and inner table S
(Used if B(R) + B(S) < M) (fits in memory)
- Scan R, sort in Main Memory
- Scan S, sort in Main Memory
- Merge R and S
Cost: B(R) + B(S)
Index Nested Join
R has an index over the Join Attribute
- Scan S:
- For each tuple of S, find matching tuples of R
V(R,a) = number of distinct values of a in R
Cost:
- If data clustered
- Cost = B(S) + T(S)*B(R) / V(R,a)
- If data unclustered
- Cost = B(S) + T(S)*T(R) / V(R,a)
Selection:
Two Types of Scans
+ Their Costs
Scanning Without Index:
Cost = B(R) + Seek Time
Scanning With Index:
If there is an index over the attribute, use a function F to determine the # of I/O operations
Data Clustered:
Cost = F(R,a)*B(R)
Data Unclustered:
Cost = F(R,a)*T(R)
Selection on Indexed Attribute:
I/O Cost Functions for possible
selection conditions
Equals : σa=v (R)
F(R,a) = 1 / V(R,a)
Less Than: σa<v> (R)</v>
F(R,a) = (v - min(R,a) ) / (max(R,a) - min(R,a) )
Range: σx<a> (R)</a>
F(R,a) = (v - x ) / (max(R,a) - min(R,a) )
Assumption about I/O Cost Function F
for most DBMS Applications
Most DBMSs use a B+ Tree,
so for searching on a Clustered Index,
we can use the Height of the B+ tree.
Cost = H + B(R)
Sampling:
Overview
- Can be used to save time on large Databases
- Sample from tables, join based on the sample instead
- Less accurate, but much faster
- Two Types:
- Reservoir Sampling
- Reservoir Join Sampling
Sampling:
Reservoir Sampling
W <- 0 start with 0 weight
Initialize reservoir array A[1..r] with dummy values.
While tuple streams in:
- Get next tuple t with weight w(t)
- W <- W + w(t) //increase the weight
- for j=1 to r
- set A[j] to t with probability w(t)/W
Sampling:
Reservoir Join Sampling
W <- 0 start with 0 weight
Initialize reservoir array A[1..r] with dummy values.
For each Candidate Network(CN) and every tuple t in CN:
If A has dummy values,
Insert k copies of t into A
Else
W <- W + w(t)
for i = 1 to k:
Insert t into A[i] with probability w(t)/W
Terminate if copy of t is inserted
Hash Join
Overview
- Scan R, build buckets in Main Memory
- Scan S, then Join
Notes:
- Build the hash table on whichever table is smaller: min( B(R), B(S) )
- Only works if min( B(R), B(S) ) <= M-2
- It has to fit inside of main memory
Cost = B(R) + B(S)
Partitioned Hash Join
Used when a relation cannot fit into memory.
- Using the first Hash Function, H1 :
- Hash R into M-1 buckets, store on disk
- Hash S into M-1 buckets, store on disk
- Read one partition(bucket) of R into Main Memory
- Join like normal hash join, using a second, separate hash function, H2
Cost:
3B(R) + 3B(S)
Memory:
Smaller bucket must fit in memory
Bucket size = B(R) / (M-1)