6. Sorting Flashcards
Why databases need special sorting algorithms?
All of the traditional sorting algorithms (i.e. quick sort, insertion sort, etc.) rely on us being able to store all of the data in memory. This is a luxury we do not have when developing a database. In fact, most of the time our data will be an order of magnitude larger than the memory available to us.
What do we measure when analyzing database-specific sorting algorithms?
Because of how time-consuming it is to go to disk, we only look at the number of I/Os an algorithm incurs when analyzing its performance. We can pretty much ignore traditional measures of algorithmic complexity like big-O.
Two Way External Merge Sort - explain the algorithm
- Sort every individual phase (“conquer” phase)
- Sorted run - merge 2 unsorted sets of pages (“merging” phase)
Two Way Merge complexity - analysis sketch
N data pages
each pass over the data will take 2∗N I/Os
“conquer” pass is always needed, so we have 2*N right away
we need log2(N) merging passes,
so
2N ∗ (1 + log2(N)) I/Os in total
Full External Sort - a main improvement over Two Way Merge
Use an in-memory buffer to store sorted results instead of sorting each page separately
Full External Sort - explain
Let’s assume we have B buffer pages.
- Conquer phase - load B pages and sort them all at once. Repeat until such bundles of pages are sorted.
- Merge phase - merge B-1 sorted runs at a time (instead of 2 in two-way merge)
Full External Sort complexity - analysis sketch
The conquering pass produces only N/B sorted runs
As soon as we merge B-1 runs on every step, we have
logB−1 (N/B) sorted runs
(log base is B-1 instead of 2)
So…
2N ∗ (1 + logB−1 N/B) I/Os