QueryProcessing Flashcards
What is query processing?
The range of activities involved in extracting data from a database, including translation of high-level database language queries into expressions usable at the physical file system level
What are the three main steps in query processing?
- Parsing and translation
- Optimization
- Evaluation
What is a query-execution plan?
A sequence of primitive operations that can be used to evaluate a query. Each operation is annotated with instructions specifying how to evaluate it
What is typically the dominant cost factor in query evaluation for large databases?
I/O cost (disk access) dominates other costs like CPU time and communication
How is the cost of an operation that transfers b blocks and performs S random I/O accesses calculated?
b * tT + S * tS seconds where tT is time to transfer a block and tS is block-access time
Why are block writes typically more expensive than block reads on magnetic disks?
Block writes are about twice as expensive because disk systems read sectors back after writing to verify the write was successful
What percentage of B+-tree nodes are typically non-leaf nodes?
About 1% of nodes are non-leaf nodes
What is the difference between materialized evaluation and pipelined evaluation?
Materialized evaluation creates and stores temporary results between operations while pipelined evaluation passes results directly from one operation to the next without storing intermediate results
What are the two types of pipelining?
- Demand-driven pipeline: operations wait for requests to produce tuples
- Producer-driven pipeline: operations eagerly generate tuples without waiting for requests
What is a blocking operation in query processing?
An operation that cannot output any results until all input tuples have been examined (e.g. sorting)
What are the three main algorithms for computing joins?
- Nested-Loop Join
- Merge Join
- Hash Join
What is the main advantage of Block Nested-Loop Join over basic Nested-Loop Join?
In Block Nested-Loop Join each block of the inner relation is read only once for each block of the outer relation instead of once for each tuple
When can Indexed Nested-Loop Join be used?
When computing equi-join or natural join and there exists an index on the join attribute
What are the two main phases of external sort-merge?
- Creation of sorted runs
- Merging of runs
What is a bitmap index scan in PostgreSQL?
An algorithm that first creates a bitmap marking blocks containing matching records then only fetches and scans those specific blocks avoiding reading unnecessary blocks
What are the two main metrics used to estimate query costs in disk-based systems?
- Number of blocks transferred from storage
- Number of random I/O accesses
How does a hybrid merge-join algorithm work?
It merges a sorted relation with the leaf entries of a secondary B+-tree index producing a result file containing tuples from the sorted relation and addresses for tuples of the unsorted relation
What is the basic idea behind hash join?
Partition tuples of both relations using a hash function on the join attributes so that matching tuples will be in the same partition and only tuples in corresponding partitions need to be compared
What are the steps in implementing duplicate elimination using sorting?
- Sort the relation
- Scan the sorted result
- Remove all but one copy of identical tuples that appear adjacent to each other
What is partial aggregation optimization?
Combining tuples in the same group during run generation and intermediate merges by computing partial aggregate values. Only works for commutative and associative operations (count min max sum)
What are the three functions provided by an iterator in demand-driven pipelining?
- open()
- next()
- close()
What happens in the parsing and translation step of query processing?
The system translates the query into its internal form and performs syntax checking and verification of relations
What is query optimization?
The task of constructing a query-evaluation plan that minimizes the cost of query evaluation
What special consideration is needed for joins with spatial data?
Spatial joins require special comparison operators for containment and overlap and usually use indexed nested loops join with spatial indices since no simple sort order exists
What is the purpose of double buffering in materialized evaluation?
It allows the algorithm to execute more quickly by performing CPU activity in parallel with I/O activity
In hash join what is the build input and probe input?
The relation s is the build input (used to build in-memory hash index) and r is the probe input
How is the hash join fudge factor typically calculated?
f is typically around 1.2 used to account for some skewness of the join values
What is the key advantage of demand-driven pipelining over producer-driven pipelining?
It is easier to implement though producer-driven pipelining can be more efficient on modern CPUs
What happens in the build-probe phase of hash join?
The algorithm loads si into memory builds an in-memory hash index then reads ri tuples and locates matching tuples using the index
What is an evaluation primitive?
A relational-algebra operation annotated with instructions specifying how to evaluate it
When is producer-driven pipelining particularly useful?
In parallel processing systems
What are the two main components of disk access cost?
- Block transfer time (tT)
- Block access/seek time (tS)
How can joins with conjunctive conditions be implemented?
Either using hash join with multiple hash functions or by computing one simple join first and then filtering results with remaining conditions
How can set difference be implemented efficiently?
By first sorting both relations then scanning through each of the sorted relations once to produce the result
What is the main advantage of implementing aggregation with partial aggregation?
It reduces the amount of data that needs to be stored and processed in subsequent steps
What determines the number of runs in external sort-merge?
The size of the relation (br) divided by the memory available (M): ⌈br/M⌉
What is the main limitation of implementing full outer join using nested-loop join?
It is difficult to extend the nested-loop join algorithm to compute full outer join
What happens if B+-tree internal nodes don’t fit in memory?
A traversal of the index will require multiple I/O operations significantly increasing access cost
How can disjunctive join conditions be implemented?
By computing the union of the records from individual joins for each condition
What is the best case for nested-loop join cost?
When one relation fits entirely in memory requiring only br + bs block transfers and two seeks
How does sorted merge join work?
Once relations are sorted tuples with same join attribute values are consecutive allowing a single pass through both files