Clustering Flashcards

Question 1

Q

Give the definition of metric space.

Answer

A

A metric space is an ordered pair (M, d) where M is a set and d(·) is a
metric on M, i.e., a function d : M × M → ℜ
such that for every x, y, z ∈ M the following holds

d(x, y) ≥ 0;
d(x, y) = 0 if and only if x = y;
d(x, y) = d(y, x); (symmetry)
d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)

Question 2

Q

What are the Minkowski distances?

Answer

A

Given two points x and y of dimension n we have:

dr(x, y) = (Σ from i=1 to n, |xi - yi|^r)^1/r

if r = 2 we have the Euclidean distance
if r = 1 we have the Manhattan distance
if r = +inf we have the Chebyshev distance

Question 3

Q

Give the definition of center based clustering.

Answer

A

Let P be a set of N points in metric space (M, d), and let k be the target number of clusters, 1 ≤ k ≤ N. We define a k-clustering of P as a tuple C = (C1, C2, . . . , Ck ; c1, c2, . . . , ck) where

(C1, C2, . . . , Ck ) defines a partition of P, i.e., P = C1 ∪ C2 ∪ · · · ∪ Ck
c1, c2, . . . , ck are suitably selected centers for the clusters, where ci ∈ Ci for every 1 ≤ i ≤ k

Question 4

Q

Define the three main type of center-based clustering problems and specify their objective functions.

Answer

A

Let k > 0 and let (M, d) be a metric space. The k-center/k-means/k-median clustering problems are optimization problems that, given a finite pointset P ⊆ M, require to determine a subset S ⊆ P of k centers which minimizes the following respective objective functions:

Φkcenter(P, S) = max x∈P d(x, S) (k-center clustering)
Φkmeans(P, S) = Σ x∈P (d(x, S))^2 (k-means clustering).
Φkmedian(P, S) = Σ x∈P (d(x, S)) (k-median clustering).

Question 5

Q

Define and give the pseudocode of the Farthest-First Traversal algorithm for the k-center clustering problem.

Answer

A

Input: set P of N points from a metric space (M, d), integer k > 1;
Output: set S of k centers which is a good solution to the k-center problem on P (i.e., Φkcenter(P, S) “close” to Φopt_kcenter(P, k)).

S ← {c1} // c1 ∈ P arbitrary point
for i ← 2 to k do
.............Find the point ci ∈ P − S that maximizes d(ci, S)
.............S ← S ∪ {ci}
return S

Question 6

Q

What is the approximation of the Farthest-First Traversal algorithm?

Answer

A

Let S be the set of centers returned by running Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 2 · Φopt_kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm.

Study the proof on the slides!

Question 7

Q

Explain the coreset technique for solving a problem given a massive input P.

Answer

A

Suppose that we want to solve a problem Π for a massive input P.

1) Extract a ”small” subset T from P (dubbed coreset), making sure that it represents P well;
2) Run best known (possibly slow) sequential algorithm for Π on the small coreset T, rather than on the entire input P.

The technique is effective if:
• T can be extracted efficiently by processing P either on a distributed platform or as a stream;
• The solution computed on T is a good solution for Π w.r.t. the entire input P.

Question 8

Q

How can we implement the coreset technique for the k-center clustering problem?

Answer

A

Consider a large pointset P and a target granularity k.

1) Select a coreset T ⊂ P making sure that each x ∈ P − T is close to some point in T (approx. within Φopt_kcenter(P, k));
2) Search the set S of k final centers in T, so that each point of T is at distance approx. within Φopt_kcenter(P, k) from the closest center.

By combining the above two points, we gather that each point of P is at distance O(Φopt_kcenter(P, k)) from the closest center.

Question 9

Q

Describe the MapReduce-Farthest-First Traversal algorithm. What is the approximation?

Answer

A

1) Split P into subsets Pi;
2) Extract coresets from each Pi obtaining Ti (use the FFT!);
3) Merge all Ti obtaining T;
4) Run FFT on T obtaining S.

Theorems:
For every x ∈ P we have
d(x,T) ≤ 2 · Φopt_kcenter(P, k).

Let S be the set of k centers returned by running MR-Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 4 · Φopt_kcenter(P, k).
That is, MR-Farthest-First Traversal is a 4-approximation algorithm.

Question 10

Q

Does a random sample provide a good coreset?

Answer

A

No, because for large datasets the random sampling picks outliers with probability close to 0.

Question 11

Q

Consider the diameter of a pointset: we want to approximate it.

For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)] (this can proved easily with the triangular inequality), how can we obtain a better approximation?

Answer

A

Coreset-based approximation:
• Fix a suitable k ≥ 2.
• Extract a coreset S ⊂ P of size k by running a k-center clustering algorithm on P and taking the k cluster centers as set S.
• Return dS = max x,y∈S d(x, y) as an approximation of dmax.

Study the theorem on the slides!

Question 12

Q

Provide the pseudocode of the Lloyd’s algorithm for the k-means clustering problem.

Answer

A

Input: Set P of N points from ℜD, integer k > 1
Output: Set S ⊂ ℜD of k centers with small Φkmeans(P, S).
S = {c1, c2, . . . , ck } ← arbitrary set of k points in ℜD;
Φ ← Φkmeans(P, S); // Initial cost

while (true) do:
…………..for 1 ≤ i ≤ k do:
……………………….Ci ← points of P closest to ci;
……………………….ci ← centroid of Ci;
…………..Let S = {c1, c2, . . . , ck };
…………..if (Φkmeans(P, S) < Φ) then Φ ← Φkmeans(P, S);
…………..else return S;

Bad for large datasets
Can be trapped into local minima (if we select bad points as the initial k centers)

Question 13

Q

Provide the pseudocode of the k-means++ algorithm.

Answer

A

Algorithm k-means++ computes a (initial) set S of k centers for P using the following procedure (note that S ⊆ P):

c1 ← random point chosen from P with uniform probability;
S ← {c1};

for 2 ≤ i ≤ k do:
…………..foreach x ∈ P − S do π(x) ← (d(x, S))^2/Σ y∈P−S(d(y, S))^2;
…………..ci ← random point in P − S according to distribution π(·);
…………..S ← S ∪ {ci}
return S

E[Φkmeans(P, S)] ≤ 8(ln k + 2) · Φopt_kmeans(P, k).

Question 14

Q

What is the objective function of the weighted k-means clustering problem?

Answer

A

Φw_kmeans(P, S) = Σ x∈P w(x) · (d(x, S))^2

Question 15

Q

Describe the MR-kmeans(A) algorithm where A is a sequential algorithm for computing coresets.

Answer

A

Round 1:
• Map Phase: Partition P arbitrarily in ℓ subsets of equal size P1, P2, . . . , Pℓ.
• Reduce Phase: for every i ∈ [1, ℓ] separately, run A on Pi (with unit weights) to determine a set Ti ⊆ Pi of k centers, and define:
- For each x ∈ Pi, the proxy π(x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi whose proxy is y.

Round 2:
• Map Phase: empty.
• Reduce Phase: gather the coreset T = ∪ i=1,…,ℓ Ti of ℓ · k points, together with their weights, and run, using a single reducer, A on T (with the given weights) to determine a set S = {c1, c2, . . . , ck} of k centers, which is then returned as output..

Question 16

Q

What is the approximation of the MR-kmeans(A) algorithm?

Answer

Study These Flashcards

A

Suppose that A is an α-approximation algorithm for weighted k-means clustering, for some α > 1, and let S be the set of k centers returned by MR-kmeans(A) on input P. Then:

Φkmeans(P, S) = O(α^2· Φopt_kmeans(P, k)).

That is, MR-kmeans(A) is an O(α^2)-approximation algorithm.

Question 17

Q

Consider the clustering evaluation: describe the silhouette technique and how to approximate it.

Answer

Study These Flashcards

A

Let C = (C1, C2, . . . , Ck ) be a partition of P into k clusters. For
any x ∈ P the silhouette coefficient for x is:

• sx = (bx − ax)/max{ax , bx}

where:

ax = dsum(x, Ci)/|Ci| 
bx = min j != i dsum(x, Cj)/|Cj|,

To measure the quality of C we define the average silhouette
coefficient as:

• sC = 1/|P| * Σ x∈P sx (average of all the coefficients)

You can approximate the silhouette coefficient using:

St ⊂ C, where each y ∈ C is independently
included in St with probability t/n (Poisson sampling)

• d˜sum(x, C, t) = n/t * Σ y∈St d(x, y)

and re-computing the average silhouette coefficient.

Clustering Flashcards

(17 cards)