Typical Measure is Total Elapsed Time Factors: Disk Access CPU Network Communication Disk access is usually the "bottleneck"

Query Processing (2019) Flashcards by Emma Frost

Query Processing:

Basic Steps

Query
- User inputs a query
Parser and Translator
- Translates query from human readable form to something DBMS understands, such as Relational Algebra Expression
Optimizer
- Finds optimized equivalent to RA expression
- Creates Execution Plan
Evaluation Engine
- Runs the Execution Plan to perform the query
Query Output
- Results returned to user

How well did you know this?

Not at all

Perfectly

Query Processing:

Diagram of the Basic Steps

*See Diagram

How well did you know this?

Not at all

Perfectly

Query Processing Steps:

Query

User inputs a query in some

human readable language,

such as SQL.

This query is passed to the Parser and Translator

How well did you know this?

Not at all

Perfectly

Query Processing Steps:

Parser and Translator

Recieves Query in some human readable form, such as SQL.
Performs Syntax Checks
Translates to a language that the DBMS understands internally(RA)
Generates Relational Algebra Expression

How well did you know this?

Not at all

Perfectly

Query Processing Steps:

Optimizer

Recieves Relational Algebra Expression from Parser/Translator
Finds an equivalent statement that is more optimal
- Same results, with less time and resources used
From this optimal statement, generates an Execution Plan to be run by the Evaluation Engine

How well did you know this?

Not at all

Perfectly

Query Processing Steps:

Evaluation Engine

Receives an Execution Plan from the Optimizer
Performs it on the Database, obtaining results to be output to the user
Returns Query Output to the User

How well did you know this?

Not at all

Perfectly

Cost of a Query

Typical Measure is Total Elapsed Time
Factors:
- Disk Access
- CPU
- Network Communication
Disk access is usually the “bottleneck”

How well did you know this?

Not at all

Perfectly

Disk Access:

How Relations are Stored on Disk

DBMS stores its data in Files
Files are broken into Blocks
The number of tuples per block varies
B(R) is the number of blocks used for relation R
T(R) is the number of tuples used in relation R

How well did you know this?

Not at all

Perfectly

Three Factors of

Disk Access cost

Number of Seeks
Number of Block Reads
Number of Block Writes

How well did you know this?

Not at all

Perfectly

Disk Access:

Why are Block Writes

more expensive than

Block Reads?

A block must be read again after writing,

so the actual time is:

(Time to Write) + (Time to Read)

How well did you know this?

Not at all

Perfectly

Seek Time:

Basic Idea

The time spent

spinning the disk

to get to the specified data area

How well did you know this?

Not at all

Perfectly

Seek Time:

Two types of Indices on Disk

Clustered Index
- Data is stored in sorted order
- Data is very close together on disk
- Less Seek Time -> Faster
Unclustered Index
- Data is stored farther apart
- More Seek Time -> Slower

How well did you know this?

Not at all

Perfectly

Types of

External Memory Joins

(Extended) (9)

Nested Join
Nested Loop Join
Block-Nest Loop Join
Index Nested Loop Join
Improved Nest Loop Join
Reservoir Sampling
Reservoir Join Sampling
Hash Join
Partitioned Hash Join

How well did you know this?

Not at all

Perfectly

Three Types of

Join Algorithms (External Memory)

Nested Loop Joins

Hash Joins

Sort-Merge Joins

How well did you know this?

Not at all

Perfectly

External Memory Join Algorithms:

Nested Loop Joins

General Cost

Nested Loop Joins

cost = O( |R| |S| )

How well did you know this?

Not at all

Perfectly

External Memory Join Algorithms:

Hash Joins

General Cost

Study These Flashcards

Hash Joins

Cost = O( |S| )

if |R| < |S|

*I believe R = read cost, S=search Cost

External Memory Join Algorithms:

Sort-Merge Joins

General Cost

Study These Flashcards

Sort-Merge Joins

Cost = O( |S| * log(|s|) )

if |R| < |S|

*I believe R = read cost, S=search Cost

Query Costs:

External Memory Joins

Study These Flashcards

Happens when at least one relation does not fit into memory
I/O Cost is the dominating factor
Call the memory limit M
Cost is depended on the Memory Join Algorithm used
Internal Memory Joins are Simple:
- B(s) + B(s)

Nested Loop Join:

Overview

Study These Flashcards

For each block of a relation R, and for each tuple r in block:

For each block of a relation S, and for each tuple s in block:

Output rs if the join condition evaluates to true

R is called the Outer Table
S is called the Inner Table

Cost = B(R) +T(R)*B(S)

Memory: 4 Blocks, if double buffering

Block-Nest Loop Join:

Overview

Study These Flashcards

For each block of R and for each block of S:

For each tuple r of R and s of S:

Output rs if join condition evaluates to true.

R is the outer table, S is the inner table

Cost:

B(R) + B(R)*B(S)

Memory:

4 Blocks (If double buffering)

Improved Nest Loop Join

Study These Flashcards

Outer table R and inner table S

Use available memory, read M-2 blocks from table R
Read blocks of S sequentially, join with R in Main Memory

Cost:

B(R) + B(R)*B(S) / (M-2)

Memory:

M Blocks

Nested Join

Study These Flashcards

Joining Outer table R and inner table S

(Used if B(R) + B(S) < M) (fits in memory)

Scan R, sort in Main Memory
Scan S, sort in Main Memory
Merge R and S

Cost: B(R) + B(S)

Index Nested Join

Study These Flashcards

R has an index over the Join Attribute

Scan S:
- For each tuple of S, find matching tuples of R

V(R,a) = number of distinct values of a in R

Cost:

If data clustered
- Cost = B(S) + T(S)*B(R) / V(R,a)
If data unclustered
- Cost = B(S) + T(S)*T(R) / V(R,a)

Selection:

Two Types of Scans

+ Their Costs

Study These Flashcards

Scanning Without Index:

Cost = B(R) + Seek Time

Scanning With Index:

If there is an index over the attribute, use a function F to determine the # of I/O operations

Data Clustered:

Cost = F(R,a)*B(R)

Data Unclustered:

Cost = F(R,a)*T(R)

Selection on Indexed Attribute: I/O Cost Functions for possible selection conditions

_Equals_ : **σ_a=v (R)** **F(R,a)** = 1 / **V(R,a)** Less Than: **σ_{a (R)}** **F(R,a) = (v - min(R,a) ) / (max(R,a) - min(R,a) )** Range: **σ_{x (R)}** **F(R,a) = (v - x ) / (max(R,a) - min(R,a) )**

Assumption about I/O Cost Function F for most DBMS Applications

Most DBMSs use a **B+ Tree**, so for searching on a _Clustered Index_, we can use the _Height of the B+ tree._ **Cost = H + B(R)**

Sampling: Overview

* Can be used to save time on large Databases * Sample from tables, join based on the sample instead * Less accurate, but much faster * Two Types: * Reservoir Sampling * Reservoir Join Sampling

Sampling: Reservoir Sampling

**W** \<- 0 start with 0 weight Initialize reservoir array **A[**1..**r]** with dummy values. While tuple streams in: * Get next tuple **t** with weight **w(t)** * **W** \<- **W** + **w(t)** //increase the weight * for **j**=1 to **r** * set **A[j]** to **t** with probability **w(t)/W**

Sampling: Reservoir Join Sampling

**W** \<- 0 start with 0 weight Initialize reservoir array **A**[1..**r**] with dummy values. For each **Candidate Network(CN)** and every tuple **t** in **CN**: If **A** has dummy values, Insert **k** copies of **t** into **A** Else **W** \<- **W** + **w(t)** for i = 1 to **k**: Insert **t** into **A[i]** with probability **w(t)/W** Terminate if copy of **t** is inserted

Hash Join Overview

* Scan R, build buckets in Main Memory * Scan S, then Join Notes: * Build the hash table on whichever table is smaller: **min**( **B(R)**, **B(S)** ) * Only works if **min**( **B(R)**, **B(S)** ) \<= M-2 * It has to fit inside of main memory **Cost = B(R) + B(S)**

Partitioned Hash Join

Used when a relation cannot fit into memory. * Using the first Hash Function, **H₁** : * Hash **R** into **M**-1 buckets, store on disk * Hash **S** into **M**-1 buckets, store on disk * Read one partition(bucket) of R into Main Memory * Join like normal hash join, using a second, separate hash function, **H₂** Cost: **3B(R) + 3B(S)** Memory: Smaller bucket must fit in memory Bucket size = **B(R) / (M-1)**

Query Processing (2019) Flashcards

(31 cards)