Coreset Flashcards

Question 1

Q

Coreset technique

Answer

A

Suppose that we want to solve a problem Π on instances P which are too
large to be processed by known algorithms.

Coreset technique
1 Extract a ”small” subset T from P (dubbed coreset), making sure
that it represents P well, w.r.t. solutions to Π.
2 Run best known (possibly slow) sequential algorithm for Π on the
small coreset T, rather than on the entire input P.

The technique is effective if
* T can be extracted efficiently by processing P, possibly in a
distributed or streaming fashion.
* The solution computed on T is a good solution for Π w.r.t. the
entire input P.

Question 2

Q

Composable Coreset technique

Answer

A

Partition P into l subsets P1, P2, . . . , Pl, and extract a ”small”
coreset Ti from each Pi, making sure that it represents Pi well.
Run best known (possibly slow) sequential algorithm for Π on
T = ∪ Ti, rather than on P.

The technique is effective if
* Each Ti can be extracted efficiently from Pi
in parallel for all i’s.
* The final coreset T is still small and the solution computed on T is a good solution for Π w.r.t. the entire input P.

Question 3

Q

Metric Space

Answer

A

Typically, the input of a clustering problem consists of a set of points
from a metric space.

Definition
A metric space is an ordered pair (M, d) where M is a set and d(·) is a metric on M, i.e., a function
d : M × M → R
such that for every x, y, z ∈ M the following holds
* d(x, y) ≥ 0;
* d(x, y) = 0 if and only if x = y;
* d(x, y) = d(y, x); (symmetry)
* d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)

Question 4

Q

Distance functions

Answer

A

Minkowski distances

r=2: Euclidean rrenja katrore.

r=1 Manhatan distance: vec 90 gradesh.

r= infinit chebishevi..
max of all dimensions

Cosine (or angular) distance. It is used when points are vectors i

dcosine(X, Y ) = arccos(X · Y/
||X|| · ||Y ||)

Hamming distance - binary vectors
X = (0, 1, 1, 0, 1) and Y = (1, 1, 1, 0, 0). Distance is 2.

Jaccard distance:Sets; 1 - (prerje/bashkim)

Question 5

Q

Distance FUnctions and application

Answer

A

Minkowski: aggregates gap in each dimension between 2 objects: GPS

Cosine: measures ratio among feature values.
s. Example: distance
between documents of different lengths, where each document
is characterized by the relative frequency of words from a
given vocabulary.

Hamming distance:it measures the total number of
differences between two objects. Examples: fixed-length
binary words, genetic distance.

Jaccard distance: it measures the ratio between the number
of differences between two sets and the total number of points
in the sets. Example: the topics describing a document.

Question 6

Q

Approximation algorithms play a crucial role i

Answer

A

For several important optimization problems finding optimal solutions for large inputs is very time consuming. E.g.,
NP-hard problems for which it is likely that exponential time is required.
-Problems for which polynomial-time algorithms exist but
require superlinear number of operations (e.g., quadratic,
cubic).
Allowing for approximations widens the spectrum of possible
solutions and makes it easier to find one.
In real-world instances, often affected by noise or uncenrtainty, the
true optimum may not be well defined.

Question 7

Q

Center-based clustering

Answer

A

k-center clustering: aims at minimizing the maximum distance of
any point from the center of its cluster
k-means clustering: aims at minimizing the sum of the squared
distances of the points from the centers of their respective clusters
(a.k.a. Sum of Squared Errors (SSE))
k-median clustering: aims at minimizing the sum of the distances
of the points from the centers of their respective clusters

Let (M, d) be a metric space. The k-center/k-means/k-median clustering
problems are optimization problems that, given a finite pointset P ⊆ M
and an integer k ≤ |P|, require to return the subset S ⊆ P of k centers
which minimizes the following respective objective functions:

Φkcenter(P, S) = maxx∈P d(x, S) (k-center clustering)
Φkmeans(P, S) = P
x∈P
(d(x, S))2
(k-means clustering).
Φkmedian(P, S) = P
x∈P
(d(x, S)) (k-median clustering).

Question 8

Q

Assignment primitive: Assign(P,S)

Answer

A

Consider a pointset P of N points represented as a key-value pairs
(IDx , x), where IDx is a distinct integer in [0, N), and x is a point.
Let also S be a set of k centers. Considering S as global data,
show that Assign(P, S) can be implemented in MapReduce in 1
round using ML = O (k).

Question 9

Q

Observations to choose.

Answer

A

k-center is useful when we need to guarantee that every point
is close to a center (e.g., if centers represent locations of
crucial facilities or if they are needed as ”representatives” for
the data). As we will see later, it is sensitive to noise.
k-means/median provide guarantees on the average (square)
distances. k-means is more sensitive to noise but there exist
faster algorithms to solve it.

Question 10

Q

KCenters Farthest-First Traversal: algorithm

Answer

A

Popular 2-approximation sequential algorithm

Farthest-First Traversal requires k − 1 scans of
the pointset P: impractical for massive P and k not small. -> To solve it: composable coreset technique

Simple and, somewhat fast

Input Set P of N points from a metric space (M, d), integer k > 1

Output A set S of k centers which is a good solution to the k-center
problem on P (i.e., Φkcenter(P, S) “close” to Φ opt kcenter(P, k))

S ← {c1} // c1 ∈ P arbitrary point

for i ← 2 to k do
Find the point ci ∈ P − S that maximizes d(ci
, S)
S ← S ∪ {ci}
return S

Lemma
For every x ∈ P we have
d(x,T) ≤ 2 · Φ
opt
kcenter(P, k).

Question 11

Q

Farthest-First Traversal: analysis
Theorem
Let S be the set of centers returned by running Farthest-First Traversal
on P. Then:
Φkcenter(P, S) ≤ 2 · Φ
opt
kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm

Question 12

Q

Observations on k-center clustering

Answer

A

The k-center objective focuses on worst-case distance of
points from their closest centers.

Farthest-First Traversal’s approximation guarantees are almost
the best one can obtain in practice. It was proved that
computing a c-approximate solution to k-center is NP-hard
for any fixed c < 2.

The k-center objective is very sensitive to noise. For noisy
datasets (e.g., with outliers), the clustering which optimizes
the k-center objective may obfuscate some “natural”
clustering inherent in the data. See example in the next slide.

Question 13

Q

Exact diameter computation

Answer

A

2-approximation to the diameter

For an arbitrary xi ∈ P define
dmax(i) = max{d(xi
, xj) : 0 ≤ j < N}.
Lemma
For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)].

Question 14

Q

Better, coreset-based diameter approximation

Answer

A

Coreset-based approximation:
* Fix a suitable k ≥ 2.
* Extract a coreset T ⊂ P of size k by running a k-center clustering
algorithm on P and taking the k cluster centers as set T.
* Return dT = maxx,y∈T d(x, y) as an approximation of dmax.

Can the approximation be computed efficiently?
If k = O (1), dT can be computed
* Sequentially: in O (N) time (using Farthest-First Traversal)
* In MapReduce: in 2 rounds, with local space ML = O
√
N

and
aggregate space ML = O (N) (through MR-Farthest-First Traversal)

Question 15

Q

Diversity maximization

Answer

A

Determine the most diverse subset of size k (for a given small k)

Diversity maximization problem: remote clique variant

Given a set P of points from a metric space (M, d) and a positive integer
k < |P|, return a subset S ⊂ P of k points, which maximizes the
diversity function
div(S) = X
x,y∈S
d(x, y).

Question 16

Q

Coreset-based approach to diversity maximization

Answer

Study These Flashcards

A

Definition: (1 + )-coreset
Let ∈ (0, 1) be an accuracy parameter. A subset T ⊂ P is an
(1 + )-coreset for the diversity maximization problem on P if
divopt(T, k) ≥
1
1 +
divopt(P, k).

Given P and k:
* Run Farthest-First Traversal to extract a set T
0 of h > k centers
from P, for a suitable value h > k.
* Consider the h-clustering induced by T
0 on P and select k arbitrary
points from each cluster (or all points in the cluster if they are less
than k).
* Gather the at most h · k points selected from the h clusters into a
coreset T, and extract the final solution S by running the best
sequential algorithm for diversity maximization on T
What is a good choice for h?

Question 17

Q

Lloyd’s algorithm

Answer

Study These Flashcards

A

APPLICABILITY. Used only for k-means when M = R
D, d(·, ·) is
the standard Euclidean distance (L2-distance), and centers can be
selected outside P.
ACCURACY. If initial centers are selected well (e.g., through
k-means++) it usually provides good solutions, but, if not, it may
be trapped into local optima.
EFFICIENCY. Efficient implementations are provided by most
common software packages, but a limit on the number of iterations
is needed in case of very slow convergence.
SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
processed by a distributed platform and only few iterations are
executed. However, it is not suitable to process data streams,

Question 18

Q

k-means++ algorithm:

Answer

Study These Flashcards

A

c1 ← random point chosen from P with uniform probability;
S ← {c1};
for 2 ≤ i ≤ k do
foreach x ∈ P − S do π(x) ← (d(x, S))2/
P
y∈P−S
(d(y, S))2
;
ci ← random point in P − S according to distribution π(·);
S ← S ∪ {ci}
return S
Observation: k-means++ is a randomized algorithm and it is known that in
expectation, the returned solution is an α-approximation, with α = Θ (ln k).

APPLICABILITY. It can be used for both k-means and k-median,
and enforces S ⊂ P.
ACCURACY. It provides decent solutions, but it is often useful to
refine them to get better accuracy.
EFFICIENCY. Very easy and efficient implementation.
SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
processed by a distributed platform and k is small. However, it is
not suitable to process data streams.

Question 19

Q

Partitioning Around Medoids (PAM) algorithm (a.k.a. k-medoids)

Answer

Study These Flashcards

A

APPLICABILITY. Mainly used for k-median, and enforces S ⊂ P.
ACCURACY. It provides good solutions.
EFFICIENCY. Very slow since each iterations requires checking
about N · k possible swaps and the convergence can be very slow.
SUITABILITY FOR MASSIVE INPUTS. Not at all!

Question 20

Q

Coreset-based approach for k-means/median

Answer

Study These Flashcards

A

The composable coreset technique can be employed to devise k-means/median
algorithms which are able: (i) to handle massive data (in a distributed setting);
and (ii) to provide good accuracy.

Each point x ∈ T is given a weight
w(x) = the number of points in P for
which x is the closest representative in T.
The local coresets Ti
’s are computed
using a sequential algorithm for
k-means/median.
The final solution S is also computed
using a sequential algorithm for
k-means/median, adapted to handle
weights.

Question 21

Q

Coreset-based MapReduce algorithm for k-means

Answer

Study These Flashcards

A

MR-kmeans(A): MapReduce algorithm for k-means (described in the
next slide) which uses a sequential algorithm A for k-means, as an
argument (functional approach!), such that:
* A solves the more general weighted variant.
* A requires space proportional to the input size.
Input Set P of N points in R
D, integer k > 1, sequential k-means
algorithm A.
Output Set S of k centers in R
D which is a good solution to the
k-means probelm on P.

Round 1:
* Map Phase: Partition P arbitrarily in ` subsets of equal size
P1, P2, . . . , P. * Reduce Phase: for every i ∈ [1,] separately, run A on Pi (with unit
weights) to determine a set Ti ⊆ Pi of k centers, and define
- For each x ∈ Pi, the proxy τ (x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi
whose proxy is y.
Round 2:
* Map Phase: empty.
* Reduce Phase: gather the coreset T = ∪
`
i=1Ti of ` · k points,
together with their weights, and run, using a single reducer, A on T
(with the given weights) to determine a set S = {c1, c2, . . . , ck } of
k centers, which is then returned as output.

Assume k ≤
√
N. By setting ` =
p
N/k, it is easy to see that
MR-kmeans(A) requires
* Local space ML = O (max{N/, · k}) = O
√
N · k

= o(N)
* Aggregate space MA = O (N)

Lemma 1: coreset quality
Let T = ∪
`
i=1Ti ⊆ P be the coreset computed by MR-kmeans(A) and let
τ : P → T be the associated proxy function.
Then, T is an α-coreset for P, k and the k-means objective, that is:
X
p∈P
(d(p, τ (p)))2 ≤ α · Φ
opt
kmeans(P, k).

Lemma 2: final solution quality
Let T ⊆ P, with associated proxy function τ : P → T, be an α-coreset
for P, k and the k-means objective. Suppose that the points of T are
weighted according to the proxy function τ .
Then, the solution S computed by A on the weighted coreset T, is an
O

α
2

-approximate solution to k-means for P, that is
Φkmeans(P, S) = O

α
2

· Φ
opt
kmeans(P, k

Question 22

Q

Observations on MR-kmeans(A)

Answer

Study These Flashcards

A

Two different k-means sequential algorithms can be used in R1 and
R2. For example, one could use k-means++ in R1, and k-means++
plus LLoyd’s in R2.
* In practice, the algorithm is fast and accurate.
* If k
0 > k centers are selected from each Pi
in R1 (e.g., through
k-means++), the quality of T, hence the quality of the final
clustering, improves.

Question 23

Q

Clustering evaluation

Answer

Study These Flashcards

A

Quality measures:
* Specific objective function, if any, which was used to select C
among all feasible clusterings. For example
k-center/k-means/k-median objective functions which, however,
capture only intra-cluster similarity.
* Silhouette coefficient, which captures both intra-cluster similarity
and inter-cluster dissimilarity.

Question 24

Q

silhouette

Answer

Study These Flashcards

A

Let C = (C1, C2, . . . , Ck ) be a partition of P into k clusters. For
any x ∈ P the silhouette coefficient for x is
sx =
bx − ax / max{ax , bx},
To measure the quality of C we define the average silhouette
coefficient as
sC =1/|P| * SUM sx

Coreset Flashcards

(24 cards)