Clustering Flashcards
Give the definition of metric space.
A metric space is an ordered pair (M, d) where M is a set and d(·) is a
metric on M, i.e., a function d : M × M → ℜ
such that for every x, y, z ∈ M the following holds
- d(x, y) ≥ 0;
- d(x, y) = 0 if and only if x = y;
- d(x, y) = d(y, x); (symmetry)
- d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)
What are the Minkowski distances?
Given two points x and y of dimension n we have:
dr(x, y) = (Σ from i=1 to n, |xi - yi|^r)^1/r
- if r = 2 we have the Euclidean distance
- if r = 1 we have the Manhattan distance
- if r = +inf we have the Chebyshev distance
Give the definition of center based clustering.
Let P be a set of N points in metric space (M, d), and let k be the target number of clusters, 1 ≤ k ≤ N. We define a k-clustering of P as a tuple C = (C1, C2, . . . , Ck ; c1, c2, . . . , ck) where
- (C1, C2, . . . , Ck ) defines a partition of P, i.e., P = C1 ∪ C2 ∪ · · · ∪ Ck
- c1, c2, . . . , ck are suitably selected centers for the clusters, where ci ∈ Ci for every 1 ≤ i ≤ k
Define the three main type of center-based clustering problems and specify their objective functions.
Let k > 0 and let (M, d) be a metric space. The k-center/k-means/k-median clustering problems are optimization problems that, given a finite pointset P ⊆ M, require to determine a subset S ⊆ P of k centers which minimizes the following respective objective functions:
- Φkcenter(P, S) = max x∈P d(x, S) (k-center clustering)
- Φkmeans(P, S) = Σ x∈P (d(x, S))^2 (k-means clustering).
- Φkmedian(P, S) = Σ x∈P (d(x, S)) (k-median clustering).
Define and give the pseudocode of the Farthest-First Traversal algorithm for the k-center clustering problem.
Input: set P of N points from a metric space (M, d), integer k > 1;
Output: set S of k centers which is a good solution to the k-center problem on P (i.e., Φkcenter(P, S) “close” to Φopt_kcenter(P, k)).
S ← {c1} // c1 ∈ P arbitrary point for i ← 2 to k do .............Find the point ci ∈ P − S that maximizes d(ci, S) .............S ← S ∪ {ci} return S
What is the approximation of the Farthest-First Traversal algorithm?
Let S be the set of centers returned by running Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 2 · Φopt_kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm.
Study the proof on the slides!
Explain the coreset technique for solving a problem given a massive input P.
Suppose that we want to solve a problem Π for a massive input P.
1) Extract a ”small” subset T from P (dubbed coreset), making sure that it represents P well;
2) Run best known (possibly slow) sequential algorithm for Π on the small coreset T, rather than on the entire input P.
The technique is effective if:
• T can be extracted efficiently by processing P either on a distributed platform or as a stream;
• The solution computed on T is a good solution for Π w.r.t. the entire input P.
How can we implement the coreset technique for the k-center clustering problem?
Consider a large pointset P and a target granularity k.
1) Select a coreset T ⊂ P making sure that each x ∈ P − T is close to some point in T (approx. within Φopt_kcenter(P, k));
2) Search the set S of k final centers in T, so that each point of T is at distance approx. within Φopt_kcenter(P, k) from the closest center.
By combining the above two points, we gather that each point of P is at distance O(Φopt_kcenter(P, k)) from the closest center.
Describe the MapReduce-Farthest-First Traversal algorithm. What is the approximation?
1) Split P into subsets Pi;
2) Extract coresets from each Pi obtaining Ti (use the FFT!);
3) Merge all Ti obtaining T;
4) Run FFT on T obtaining S.
Theorems:
For every x ∈ P we have
d(x,T) ≤ 2 · Φopt_kcenter(P, k).
Let S be the set of k centers returned by running MR-Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 4 · Φopt_kcenter(P, k).
That is, MR-Farthest-First Traversal is a 4-approximation algorithm.
Does a random sample provide a good coreset?
No, because for large datasets the random sampling picks outliers with probability close to 0.
Consider the diameter of a pointset: we want to approximate it.
For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)] (this can proved easily with the triangular inequality), how can we obtain a better approximation?
Coreset-based approximation:
• Fix a suitable k ≥ 2.
• Extract a coreset S ⊂ P of size k by running a k-center clustering algorithm on P and taking the k cluster centers as set S.
• Return dS = max x,y∈S d(x, y) as an approximation of dmax.
Study the theorem on the slides!
Provide the pseudocode of the Lloyd’s algorithm for the k-means clustering problem.
Input: Set P of N points from ℜD, integer k > 1
Output: Set S ⊂ ℜD of k centers with small Φkmeans(P, S).
S = {c1, c2, . . . , ck } ← arbitrary set of k points in ℜD;
Φ ← Φkmeans(P, S); // Initial cost
while (true) do:
…………..for 1 ≤ i ≤ k do:
……………………….Ci ← points of P closest to ci;
……………………….ci ← centroid of Ci;
…………..Let S = {c1, c2, . . . , ck };
…………..if (Φkmeans(P, S) < Φ) then Φ ← Φkmeans(P, S);
…………..else return S;
- Bad for large datasets
- Can be trapped into local minima (if we select bad points as the initial k centers)
Provide the pseudocode of the k-means++ algorithm.
Algorithm k-means++ computes a (initial) set S of k centers for P using the following procedure (note that S ⊆ P):
c1 ← random point chosen from P with uniform probability;
S ← {c1};
for 2 ≤ i ≤ k do:
…………..foreach x ∈ P − S do π(x) ← (d(x, S))^2/Σ y∈P−S(d(y, S))^2;
…………..ci ← random point in P − S according to distribution π(·);
…………..S ← S ∪ {ci}
return S
E[Φkmeans(P, S)] ≤ 8(ln k + 2) · Φopt_kmeans(P, k).
What is the objective function of the weighted k-means clustering problem?
Φw_kmeans(P, S) = Σ x∈P w(x) · (d(x, S))^2
Describe the MR-kmeans(A) algorithm where A is a sequential algorithm for computing coresets.
Round 1:
• Map Phase: Partition P arbitrarily in ℓ subsets of equal size P1, P2, . . . , Pℓ.
• Reduce Phase: for every i ∈ [1, ℓ] separately, run A on Pi (with unit weights) to determine a set Ti ⊆ Pi of k centers, and define:
- For each x ∈ Pi, the proxy π(x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi whose proxy is y.
Round 2:
• Map Phase: empty.
• Reduce Phase: gather the coreset T = ∪ i=1,…,ℓ Ti of ℓ · k points, together with their weights, and run, using a single reducer, A on T (with the given weights) to determine a set S = {c1, c2, . . . , ck} of k centers, which is then returned as output..