QueryProcessing Flashcards

1
Q

What is query processing?

A

The range of activities involved in extracting data from a database, including translation of high-level database language queries into expressions usable at the physical file system level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the three main steps in query processing?

A
  1. Parsing and translation
  2. Optimization
  3. Evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a query-execution plan?

A

A sequence of primitive operations that can be used to evaluate a query. Each operation is annotated with instructions specifying how to evaluate it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is typically the dominant cost factor in query evaluation for large databases?

A

I/O cost (disk access) dominates other costs like CPU time and communication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is the cost of an operation that transfers b blocks and performs S random I/O accesses calculated?

A

b * tT + S * tS seconds where tT is time to transfer a block and tS is block-access time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why are block writes typically more expensive than block reads on magnetic disks?

A

Block writes are about twice as expensive because disk systems read sectors back after writing to verify the write was successful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What percentage of B+-tree nodes are typically non-leaf nodes?

A

About 1% of nodes are non-leaf nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between materialized evaluation and pipelined evaluation?

A

Materialized evaluation creates and stores temporary results between operations while pipelined evaluation passes results directly from one operation to the next without storing intermediate results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two types of pipelining?

A
  1. Demand-driven pipeline: operations wait for requests to produce tuples
  2. Producer-driven pipeline: operations eagerly generate tuples without waiting for requests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a blocking operation in query processing?

A

An operation that cannot output any results until all input tuples have been examined (e.g. sorting)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the three main algorithms for computing joins?

A
  1. Nested-Loop Join
  2. Merge Join
  3. Hash Join
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main advantage of Block Nested-Loop Join over basic Nested-Loop Join?

A

In Block Nested-Loop Join each block of the inner relation is read only once for each block of the outer relation instead of once for each tuple

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When can Indexed Nested-Loop Join be used?

A

When computing equi-join or natural join and there exists an index on the join attribute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two main phases of external sort-merge?

A
  1. Creation of sorted runs
  2. Merging of runs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a bitmap index scan in PostgreSQL?

A

An algorithm that first creates a bitmap marking blocks containing matching records then only fetches and scans those specific blocks avoiding reading unnecessary blocks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the two main metrics used to estimate query costs in disk-based systems?

A
  1. Number of blocks transferred from storage
  2. Number of random I/O accesses
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How does a hybrid merge-join algorithm work?

A

It merges a sorted relation with the leaf entries of a secondary B+-tree index producing a result file containing tuples from the sorted relation and addresses for tuples of the unsorted relation

18
Q

What is the basic idea behind hash join?

A

Partition tuples of both relations using a hash function on the join attributes so that matching tuples will be in the same partition and only tuples in corresponding partitions need to be compared

19
Q

What are the steps in implementing duplicate elimination using sorting?

A
  1. Sort the relation
  2. Scan the sorted result
  3. Remove all but one copy of identical tuples that appear adjacent to each other
20
Q

What is partial aggregation optimization?

A

Combining tuples in the same group during run generation and intermediate merges by computing partial aggregate values. Only works for commutative and associative operations (count min max sum)

21
Q

What are the three functions provided by an iterator in demand-driven pipelining?

A
  1. open()
  2. next()
  3. close()
22
Q

What happens in the parsing and translation step of query processing?

A

The system translates the query into its internal form and performs syntax checking and verification of relations

23
Q

What is query optimization?

A

The task of constructing a query-evaluation plan that minimizes the cost of query evaluation

24
Q

What special consideration is needed for joins with spatial data?

A

Spatial joins require special comparison operators for containment and overlap and usually use indexed nested loops join with spatial indices since no simple sort order exists

25
Q

What is the purpose of double buffering in materialized evaluation?

A

It allows the algorithm to execute more quickly by performing CPU activity in parallel with I/O activity

26
Q

In hash join what is the build input and probe input?

A

The relation s is the build input (used to build in-memory hash index) and r is the probe input

27
Q

How is the hash join fudge factor typically calculated?

A

f is typically around 1.2 used to account for some skewness of the join values

28
Q

What is the key advantage of demand-driven pipelining over producer-driven pipelining?

A

It is easier to implement though producer-driven pipelining can be more efficient on modern CPUs

29
Q

What happens in the build-probe phase of hash join?

A

The algorithm loads si into memory builds an in-memory hash index then reads ri tuples and locates matching tuples using the index

30
Q

What is an evaluation primitive?

A

A relational-algebra operation annotated with instructions specifying how to evaluate it

31
Q

When is producer-driven pipelining particularly useful?

A

In parallel processing systems

32
Q

What are the two main components of disk access cost?

A
  1. Block transfer time (tT)
  2. Block access/seek time (tS)
33
Q

How can joins with conjunctive conditions be implemented?

A

Either using hash join with multiple hash functions or by computing one simple join first and then filtering results with remaining conditions

34
Q

How can set difference be implemented efficiently?

A

By first sorting both relations then scanning through each of the sorted relations once to produce the result

35
Q

What is the main advantage of implementing aggregation with partial aggregation?

A

It reduces the amount of data that needs to be stored and processed in subsequent steps

36
Q

What determines the number of runs in external sort-merge?

A

The size of the relation (br) divided by the memory available (M): ⌈br/M⌉

37
Q

What is the main limitation of implementing full outer join using nested-loop join?

A

It is difficult to extend the nested-loop join algorithm to compute full outer join

38
Q

What happens if B+-tree internal nodes don’t fit in memory?

A

A traversal of the index will require multiple I/O operations significantly increasing access cost

39
Q

How can disjunctive join conditions be implemented?

A

By computing the union of the records from individual joins for each condition

40
Q

What is the best case for nested-loop join cost?

A

When one relation fits entirely in memory requiring only br + bs block transfers and two seeks

41
Q

How does sorted merge join work?

A

Once relations are sorted tuples with same join attribute values are consecutive allowing a single pass through both files