DC2: Linear-time Median Flashcards
Explain how we get from the median problem to the more general “find the kth smallest item problem”?
The median of a list A is the ceiling(n/2)th smallest item. If for odd n = 2L + 1, the median is therefore the L + 1 smallest.
So solving for the kth smallest item is a generalization of the median problem.
What is the easy algorithm for finding the kth smallest item?
If the list is sorted, we can just iterate over it until we get to the kth item. So the easy algorithm would be sort, then iterate. Many sorting algorithms (like MergeSort) are O(nlogn), so we need to do better than that to beat the easy algorithm.
Describe the basics of the QuickSelect algorithm
QuickSelect finds the kth smallest item in list A, so it’s input are A and k.
Select(A, k):
1) Select a pivot p (this is the hard part)
2) Partition A in A<p, A=p, and A>p
3) If k < |A<p|, return Select(A<p, k)
If |A<p| < k <= |A<p| + |A=p|, return (p) (because the kth item must be in A=p)
If k > |A<p| + |A=p|, return Select(A>p, k - |A<p| - |A=p|)
Give the pseudocode for FastSelect(), the O(n) algorithm for finding the kth smallest item of A.
FastSelect(A, k):
1) Divide A into n/5 chunks, G_1…G_n/5
2) For i = 1 to n/5, set m_i = median(G_i)
3) Let S = {m_1, …, m_n/5}
4) Let p = FastSelect(S, n/5 * 1/2)
5) Partition A into A<p, A=p, and A>p
6) If k < |A<p|, return FastSelect(A<p, k)
else if k > |A<p| + |A=p|, return FastSelect(A>p, k - |A<p| - |A=p|)
else return (p) (because the kth item must be in A=p)
How do we find a good pivot p in O(n) time?
Instead of median, we look for “approximate median”, i.e., the value within some band around the median. Like the n/4th smallest to the 3n/4th smallest elements.
This gives a recurrence like T(n) = T(3n/4) + O(n) = O(n). This is true for any fraction q between 0 and 1: T(n) = T(qn) + O(n).
So a good pivot is any pivot p for which |A<p| <= 3n/4 and |A>p| <= 3n/4. Then we can find it in O(n) time.
Note T(n) = T(3n/4) + T(n/5) + O(n) is still O(n), because 3/4 + 1/5 < 1. So we have T(n/5) + O(n) time to find a good pivot (T(3n/4) is the time to do the rest of the algorithm).
That means we need to choose a subset S of size n/5, then set p = median(S).
S needs to be a representative sample. For each x in S, a few elements of A should be smaller and a few should be larger.
To accomplish this we take each n/5 subsets of A, and sort each of them (because they are fixed size 5, this takes O(1) time). then we take the median from each group of 5 and add it to S.