Coreset Flashcards

1
Q

Coreset technique

A

Suppose that we want to solve a problem Π on instances P which are too
large to be processed by known algorithms.

Coreset technique
1 Extract a ”small” subset T from P (dubbed coreset), making sure
that it represents P well, w.r.t. solutions to Π.
2 Run best known (possibly slow) sequential algorithm for Π on the
small coreset T, rather than on the entire input P.

The technique is effective if
* T can be extracted efficiently by processing P, possibly in a
distributed or streaming fashion.
* The solution computed on T is a good solution for Π w.r.t. the
entire input P.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Composable Coreset technique

A
  1. Partition P into l subsets P1, P2, . . . , Pl, and extract a ”small”
    coreset Ti from each Pi, making sure that it represents Pi well.
  2. Run best known (possibly slow) sequential algorithm for Π on
    T = ∪ Ti, rather than on P.

The technique is effective if
* Each Ti can be extracted efficiently from Pi
in parallel for all i’s.
* The final coreset T is still small and the solution computed on T is a good solution for Π w.r.t. the entire input P.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Metric Space

A

Typically, the input of a clustering problem consists of a set of points
from a metric space.

Definition
A metric space is an ordered pair (M, d) where M is a set and d(·) is a metric on M, i.e., a function
d : M × M → R
such that for every x, y, z ∈ M the following holds
* d(x, y) ≥ 0;
* d(x, y) = 0 if and only if x = y;
* d(x, y) = d(y, x); (symmetry)
* d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Distance functions

A

Minkowski distances

r=2: Euclidean rrenja katrore.

r=1 Manhatan distance: vec 90 gradesh.

r= infinit chebishevi..
max of all dimensions

Cosine (or angular) distance. It is used when points are vectors i

dcosine(X, Y ) = arccos(X · Y/
||X|| · ||Y ||)

Hamming distance - binary vectors
X = (0, 1, 1, 0, 1) and Y = (1, 1, 1, 0, 0). Distance is 2.

Jaccard distance:Sets; 1 - (prerje/bashkim)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Distance FUnctions and application

A

Minkowski: aggregates gap in each dimension between 2 objects: GPS

Cosine: measures ratio among feature values.
s. Example: distance
between documents of different lengths, where each document
is characterized by the relative frequency of words from a
given vocabulary.

Hamming distance:it measures the total number of
differences between two objects. Examples: fixed-length
binary words, genetic distance.

  • Jaccard distance: it measures the ratio between the number
    of differences between two sets and the total number of points
    in the sets. Example: the topics describing a document.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Approximation algorithms play a crucial role i

A
  • For several important optimization problems finding optimal solutions for large inputs is very time consuming. E.g.,
  • NP-hard problems for which it is likely that exponential time is required.
    -Problems for which polynomial-time algorithms exist but
    require superlinear number of operations (e.g., quadratic,
    cubic).
  • Allowing for approximations widens the spectrum of possible
    solutions and makes it easier to find one.
  • In real-world instances, often affected by noise or uncenrtainty, the
    true optimum may not be well defined.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Center-based clustering

A
  • k-center clustering: aims at minimizing the maximum distance of
    any point from the center of its cluster
  • k-means clustering: aims at minimizing the sum of the squared
    distances of the points from the centers of their respective clusters
    (a.k.a. Sum of Squared Errors (SSE))
  • k-median clustering: aims at minimizing the sum of the distances
    of the points from the centers of their respective clusters

Let (M, d) be a metric space. The k-center/k-means/k-median clustering
problems are optimization problems that, given a finite pointset P ⊆ M
and an integer k ≤ |P|, require to return the subset S ⊆ P of k centers
which minimizes the following respective objective functions:

  • Φkcenter(P, S) = maxx∈P d(x, S) (k-center clustering)
  • Φkmeans(P, S) = P
    x∈P
    (d(x, S))2
    (k-means clustering).
  • Φkmedian(P, S) = P
    x∈P
    (d(x, S)) (k-median clustering).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Assignment primitive: Assign(P,S)

A

Consider a pointset P of N points represented as a key-value pairs
(IDx , x), where IDx is a distinct integer in [0, N), and x is a point.
Let also S be a set of k centers. Considering S as global data,
show that Assign(P, S) can be implemented in MapReduce in 1
round using ML = O (k).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Observations to choose.

A
  • k-center is useful when we need to guarantee that every point
    is close to a center (e.g., if centers represent locations of
    crucial facilities or if they are needed as ”representatives” for
    the data). As we will see later, it is sensitive to noise.
  • k-means/median provide guarantees on the average (square)
    distances. k-means is more sensitive to noise but there exist
    faster algorithms to solve it.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

KCenters Farthest-First Traversal: algorithm

A

Popular 2-approximation sequential algorithm

Farthest-First Traversal requires k − 1 scans of
the pointset P: impractical for massive P and k not small. -> To solve it: composable coreset technique

Simple and, somewhat fast

Input Set P of N points from a metric space (M, d), integer k > 1

Output A set S of k centers which is a good solution to the k-center
problem on P (i.e., Φkcenter(P, S) “close” to Φ opt kcenter(P, k))

S ← {c1} // c1 ∈ P arbitrary point

for i ← 2 to k do
Find the point ci ∈ P − S that maximizes d(ci
, S)
S ← S ∪ {ci}
return S

Lemma
For every x ∈ P we have
d(x,T) ≤ 2 · Φ
opt
kcenter(P, k).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Farthest-First Traversal: analysis
Theorem
Let S be the set of centers returned by running Farthest-First Traversal
on P. Then:
Φkcenter(P, S) ≤ 2 · Φ
opt
kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Observations on k-center clustering

A

The k-center objective focuses on worst-case distance of
points from their closest centers.

  • Farthest-First Traversal’s approximation guarantees are almost
    the best one can obtain in practice. It was proved that
    computing a c-approximate solution to k-center is NP-hard
    for any fixed c < 2.

The k-center objective is very sensitive to noise. For noisy
datasets (e.g., with outliers), the clustering which optimizes
the k-center objective may obfuscate some “natural”
clustering inherent in the data. See example in the next slide.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Exact diameter computation

A

2-approximation to the diameter

For an arbitrary xi ∈ P define
dmax(i) = max{d(xi
, xj) : 0 ≤ j < N}.
Lemma
For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Better, coreset-based diameter approximation

A

Coreset-based approximation:
* Fix a suitable k ≥ 2.
* Extract a coreset T ⊂ P of size k by running a k-center clustering
algorithm on P and taking the k cluster centers as set T.
* Return dT = maxx,y∈T d(x, y) as an approximation of dmax.

Can the approximation be computed efficiently?
If k = O (1), dT can be computed
* Sequentially: in O (N) time (using Farthest-First Traversal)
* In MapReduce: in 2 rounds, with local space ML = O

N

and
aggregate space ML = O (N) (through MR-Farthest-First Traversal)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Diversity maximization

A

Determine the most diverse subset of size k (for a given small k)

Diversity maximization problem: remote clique variant

Given a set P of points from a metric space (M, d) and a positive integer
k < |P|, return a subset S ⊂ P of k points, which maximizes the
diversity function
div(S) = X
x,y∈S
d(x, y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Coreset-based approach to diversity maximization

A

Definition: (1 + )-coreset
Let ∈ (0, 1) be an accuracy parameter. A subset T ⊂ P is an
(1 + )-coreset for the diversity maximization problem on P if
divopt(T, k) ≥
1
1 +
divopt(P, k).

Given P and k:
* Run Farthest-First Traversal to extract a set T
0 of h > k centers
from P, for a suitable value h > k.
* Consider the h-clustering induced by T
0 on P and select k arbitrary
points from each cluster (or all points in the cluster if they are less
than k).
* Gather the at most h · k points selected from the h clusters into a
coreset T, and extract the final solution S by running the best
sequential algorithm for diversity maximization on T
What is a good choice for h?

17
Q

Lloyd’s algorithm

A
  • APPLICABILITY. Used only for k-means when M = R
    D, d(·, ·) is
    the standard Euclidean distance (L2-distance), and centers can be
    selected outside P.
  • ACCURACY. If initial centers are selected well (e.g., through
    k-means++) it usually provides good solutions, but, if not, it may
    be trapped into local optima.
  • EFFICIENCY. Efficient implementations are provided by most
    common software packages, but a limit on the number of iterations
    is needed in case of very slow convergence.
  • SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
    processed by a distributed platform and only few iterations are
    executed. However, it is not suitable to process data streams,
18
Q

k-means++ algorithm:

A

c1 ← random point chosen from P with uniform probability;
S ← {c1};
for 2 ≤ i ≤ k do
foreach x ∈ P − S do π(x) ← (d(x, S))2/
P
y∈P−S
(d(y, S))2
;
ci ← random point in P − S according to distribution π(·);
S ← S ∪ {ci}
return S
Observation: k-means++ is a randomized algorithm and it is known that in
expectation, the returned solution is an α-approximation, with α = Θ (ln k).

  • APPLICABILITY. It can be used for both k-means and k-median,
    and enforces S ⊂ P.
  • ACCURACY. It provides decent solutions, but it is often useful to
    refine them to get better accuracy.
  • EFFICIENCY. Very easy and efficient implementation.
  • SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
    processed by a distributed platform and k is small. However, it is
    not suitable to process data streams.
19
Q

Partitioning Around Medoids (PAM) algorithm (a.k.a. k-medoids)

A
  • APPLICABILITY. Mainly used for k-median, and enforces S ⊂ P.
  • ACCURACY. It provides good solutions.
  • EFFICIENCY. Very slow since each iterations requires checking
    about N · k possible swaps and the convergence can be very slow.
  • SUITABILITY FOR MASSIVE INPUTS. Not at all!
20
Q

Coreset-based approach for k-means/median

A

The composable coreset technique can be employed to devise k-means/median
algorithms which are able: (i) to handle massive data (in a distributed setting);
and (ii) to provide good accuracy.

  • Each point x ∈ T is given a weight
    w(x) = the number of points in P for
    which x is the closest representative in T.
  • The local coresets Ti
    ’s are computed
    using a sequential algorithm for
    k-means/median.
  • The final solution S is also computed
    using a sequential algorithm for
    k-means/median, adapted to handle
    weights.
21
Q

Coreset-based MapReduce algorithm for k-means

A

MR-kmeans(A): MapReduce algorithm for k-means (described in the
next slide) which uses a sequential algorithm A for k-means, as an
argument (functional approach!), such that:
* A solves the more general weighted variant.
* A requires space proportional to the input size.
Input Set P of N points in R
D, integer k > 1, sequential k-means
algorithm A.
Output Set S of k centers in R
D which is a good solution to the
k-means probelm on P.

Round 1:
* Map Phase: Partition P arbitrarily in ` subsets of equal size
P1, P2, . . . , P. * Reduce Phase: for every i ∈ [1, ] separately, run A on Pi (with unit
weights) to determine a set Ti ⊆ Pi of k centers, and define
- For each x ∈ Pi, the proxy τ (x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi
whose proxy is y.
Round 2:
* Map Phase: empty.
* Reduce Phase: gather the coreset T = ∪
`
i=1Ti of ` · k points,
together with their weights, and run, using a single reducer, A on T
(with the given weights) to determine a set S = {c1, c2, . . . , ck } of
k centers, which is then returned as output.

Assume k ≤

N. By setting ` =
p
N/k, it is easy to see that
MR-kmeans(A) requires
* Local space ML = O (max{N/, · k}) = O

N · k

= o(N)
* Aggregate space MA = O (N)

Lemma 1: coreset quality
Let T = ∪
`
i=1Ti ⊆ P be the coreset computed by MR-kmeans(A) and let
τ : P → T be the associated proxy function.
Then, T is an α-coreset for P, k and the k-means objective, that is:
X
p∈P
(d(p, τ (p)))2 ≤ α · Φ
opt
kmeans(P, k).

Lemma 2: final solution quality
Let T ⊆ P, with associated proxy function τ : P → T, be an α-coreset
for P, k and the k-means objective. Suppose that the points of T are
weighted according to the proxy function τ .
Then, the solution S computed by A on the weighted coreset T, is an
O

α
2

-approximate solution to k-means for P, that is
Φkmeans(P, S) = O

α
2

· Φ
opt
kmeans(P, k

22
Q

Observations on MR-kmeans(A)

A

Two different k-means sequential algorithms can be used in R1 and
R2. For example, one could use k-means++ in R1, and k-means++
plus LLoyd’s in R2.
* In practice, the algorithm is fast and accurate.
* If k
0 > k centers are selected from each Pi
in R1 (e.g., through
k-means++), the quality of T, hence the quality of the final
clustering, improves.

23
Q

Clustering evaluation

A

Quality measures:
* Specific objective function, if any, which was used to select C
among all feasible clusterings. For example
k-center/k-means/k-median objective functions which, however,
capture only intra-cluster similarity.
* Silhouette coefficient, which captures both intra-cluster similarity
and inter-cluster dissimilarity.

24
Q

silhouette

A

Let C = (C1, C2, . . . , Ck ) be a partition of P into k clusters. For
any x ∈ P the silhouette coefficient for x is
sx =
bx − ax / max{ax , bx},
To measure the quality of C we define the average silhouette
coefficient as
sC =1/|P| * SUM sx