6. Sorting Flashcards

Question 1

Q

Why databases need special sorting algorithms?

Answer

A

All of the traditional sorting algorithms (i.e. quick sort, insertion sort, etc.) rely on us being able to store all of the data in memory. This is a luxury we do not have when developing a database. In fact, most of the time our data will be an order of magnitude larger than the memory available to us.

Question 2

Q

What do we measure when analyzing database-specific sorting algorithms?

Answer

A

Because of how time-consuming it is to go to disk, we only look at the number of I/Os an algorithm incurs when analyzing its performance. We can pretty much ignore traditional measures of algorithmic complexity like big-O.

Question 3

Q

Two Way External Merge Sort - explain the algorithm

Answer

A

Sort every individual phase (“conquer” phase)
Sorted run - merge 2 unsorted sets of pages (“merging” phase)

Question 4

Q

Two Way Merge complexity - analysis sketch

Answer

A

N data pages

each pass over the data will take 2∗N I/Os

“conquer” pass is always needed, so we have 2*N right away

we need log2(N) merging passes,

so

2N ∗ (1 + log2(N)) I/Os in total

Question 5

Q

Full External Sort - a main improvement over Two Way Merge

Answer

A

Use an in-memory buffer to store sorted results instead of sorting each page separately

Question 6

Q

Full External Sort - explain

Answer

A

Let’s assume we have B buffer pages.

Conquer phase - load B pages and sort them all at once. Repeat until such bundles of pages are sorted.
Merge phase - merge B-1 sorted runs at a time (instead of 2 in two-way merge)

Question 7

Q

Full External Sort complexity - analysis sketch

Answer

A

The conquering pass produces only N/B sorted runs

As soon as we merge B-1 runs on every step, we have

logB−1 (N/B) sorted runs

(log base is B-1 instead of 2)

So…

2N ∗ (1 + logB−1 N/B) I/Os