Clustering Flashcards

1
Q

Give the definition of metric space.

A

A metric space is an ordered pair (M, d) where M is a set and d(·) is a
metric on M, i.e., a function d : M × M → ℜ
such that for every x, y, z ∈ M the following holds

  • d(x, y) ≥ 0;
  • d(x, y) = 0 if and only if x = y;
  • d(x, y) = d(y, x); (symmetry)
  • d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the Minkowski distances?

A

Given two points x and y of dimension n we have:

dr(x, y) = (Σ from i=1 to n, |xi - yi|^r)^1/r

  • if r = 2 we have the Euclidean distance
  • if r = 1 we have the Manhattan distance
  • if r = +inf we have the Chebyshev distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Give the definition of center based clustering.

A

Let P be a set of N points in metric space (M, d), and let k be the target number of clusters, 1 ≤ k ≤ N. We define a k-clustering of P as a tuple C = (C1, C2, . . . , Ck ; c1, c2, . . . , ck) where

  • (C1, C2, . . . , Ck ) defines a partition of P, i.e., P = C1 ∪ C2 ∪ · · · ∪ Ck
  • c1, c2, . . . , ck are suitably selected centers for the clusters, where ci ∈ Ci for every 1 ≤ i ≤ k
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define the three main type of center-based clustering problems and specify their objective functions.

A

Let k > 0 and let (M, d) be a metric space. The k-center/k-means/k-median clustering problems are optimization problems that, given a finite pointset P ⊆ M, require to determine a subset S ⊆ P of k centers which minimizes the following respective objective functions:

  • Φkcenter(P, S) = max x∈P d(x, S) (k-center clustering)
  • Φkmeans(P, S) = Σ x∈P (d(x, S))^2 (k-means clustering).
  • Φkmedian(P, S) = Σ x∈P (d(x, S)) (k-median clustering).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define and give the pseudocode of the Farthest-First Traversal algorithm for the k-center clustering problem.

A

Input: set P of N points from a metric space (M, d), integer k > 1;
Output: set S of k centers which is a good solution to the k-center problem on P (i.e., Φkcenter(P, S) “close” to Φopt_kcenter(P, k)).

S ← {c1} // c1 ∈ P arbitrary point
for i ← 2 to k do
.............Find the point ci ∈ P − S that maximizes d(ci, S)
.............S ← S ∪ {ci}
return S
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the approximation of the Farthest-First Traversal algorithm?

A

Let S be the set of centers returned by running Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 2 · Φopt_kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm.

Study the proof on the slides!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the coreset technique for solving a problem given a massive input P.

A

Suppose that we want to solve a problem Π for a massive input P.

1) Extract a ”small” subset T from P (dubbed coreset), making sure that it represents P well;
2) Run best known (possibly slow) sequential algorithm for Π on the small coreset T, rather than on the entire input P.

The technique is effective if:
• T can be extracted efficiently by processing P either on a distributed platform or as a stream;
• The solution computed on T is a good solution for Π w.r.t. the entire input P.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we implement the coreset technique for the k-center clustering problem?

A

Consider a large pointset P and a target granularity k.

1) Select a coreset T ⊂ P making sure that each x ∈ P − T is close to some point in T (approx. within Φopt_kcenter(P, k));
2) Search the set S of k final centers in T, so that each point of T is at distance approx. within Φopt_kcenter(P, k) from the closest center.

By combining the above two points, we gather that each point of P is at distance O(Φopt_kcenter(P, k)) from the closest center.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the MapReduce-Farthest-First Traversal algorithm. What is the approximation?

A

1) Split P into subsets Pi;
2) Extract coresets from each Pi obtaining Ti (use the FFT!);
3) Merge all Ti obtaining T;
4) Run FFT on T obtaining S.

Theorems:
For every x ∈ P we have
d(x,T) ≤ 2 · Φopt_kcenter(P, k).

Let S be the set of k centers returned by running MR-Farthest-First Traversal on P. Then:
Φkcenter(P, S) ≤ 4 · Φopt_kcenter(P, k).
That is, MR-Farthest-First Traversal is a 4-approximation algorithm.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Does a random sample provide a good coreset?

A

No, because for large datasets the random sampling picks outliers with probability close to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Consider the diameter of a pointset: we want to approximate it.

For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)] (this can proved easily with the triangular inequality), how can we obtain a better approximation?

A

Coreset-based approximation:
• Fix a suitable k ≥ 2.
• Extract a coreset S ⊂ P of size k by running a k-center clustering algorithm on P and taking the k cluster centers as set S.
• Return dS = max x,y∈S d(x, y) as an approximation of dmax.

Study the theorem on the slides!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Provide the pseudocode of the Lloyd’s algorithm for the k-means clustering problem.

A

Input: Set P of N points from ℜD, integer k > 1
Output: Set S ⊂ ℜD of k centers with small Φkmeans(P, S).
S = {c1, c2, . . . , ck } ← arbitrary set of k points in ℜD;
Φ ← Φkmeans(P, S); // Initial cost

while (true) do:
…………..for 1 ≤ i ≤ k do:
……………………….Ci ← points of P closest to ci;
……………………….ci ← centroid of Ci;
…………..Let S = {c1, c2, . . . , ck };
…………..if (Φkmeans(P, S) < Φ) then Φ ← Φkmeans(P, S);
…………..else return S;

  • Bad for large datasets
  • Can be trapped into local minima (if we select bad points as the initial k centers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Provide the pseudocode of the k-means++ algorithm.

A

Algorithm k-means++ computes a (initial) set S of k centers for P using the following procedure (note that S ⊆ P):

c1 ← random point chosen from P with uniform probability;
S ← {c1};

for 2 ≤ i ≤ k do:
…………..foreach x ∈ P − S do π(x) ← (d(x, S))^2/Σ y∈P−S(d(y, S))^2;
…………..ci ← random point in P − S according to distribution π(·);
…………..S ← S ∪ {ci}
return S

E[Φkmeans(P, S)] ≤ 8(ln k + 2) · Φopt_kmeans(P, k).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the objective function of the weighted k-means clustering problem?

A

Φw_kmeans(P, S) = Σ x∈P w(x) · (d(x, S))^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the MR-kmeans(A) algorithm where A is a sequential algorithm for computing coresets.

A

Round 1:
• Map Phase: Partition P arbitrarily in ℓ subsets of equal size P1, P2, . . . , Pℓ.
• Reduce Phase: for every i ∈ [1, ℓ] separately, run A on Pi (with unit weights) to determine a set Ti ⊆ Pi of k centers, and define:
- For each x ∈ Pi, the proxy π(x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi whose proxy is y.

Round 2:
• Map Phase: empty.
• Reduce Phase: gather the coreset T = ∪ i=1,…,ℓ Ti of ℓ · k points, together with their weights, and run, using a single reducer, A on T (with the given weights) to determine a set S = {c1, c2, . . . , ck} of k centers, which is then returned as output..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the approximation of the MR-kmeans(A) algorithm?

A

Suppose that A is an α-approximation algorithm for weighted k-means clustering, for some α > 1, and let S be the set of k centers returned by MR-kmeans(A) on input P. Then:

Φkmeans(P, S) = O(α^2· Φopt_kmeans(P, k)).

That is, MR-kmeans(A) is an O(α^2)-approximation algorithm.

17
Q

Consider the clustering evaluation: describe the silhouette technique and how to approximate it.

A

Let C = (C1, C2, . . . , Ck ) be a partition of P into k clusters. For
any x ∈ P the silhouette coefficient for x is:

• sx = (bx − ax)/max{ax , bx}

where:

ax = dsum(x, Ci)/|Ci| 
bx = min j != i dsum(x, Cj)/|Cj|,

To measure the quality of C we define the average silhouette
coefficient as:

• sC = 1/|P| * Σ x∈P sx (average of all the coefficients)

You can approximate the silhouette coefficient using:

  • St ⊂ C, where each y ∈ C is independently
    included in St with probability t/n (Poisson sampling)

• d˜sum(x, C, t) = n/t * Σ y∈St d(x, y)

and re-computing the average silhouette coefficient.