Coreset Flashcards
(24 cards)
Coreset technique
Suppose that we want to solve a problem Π on instances P which are too
large to be processed by known algorithms.
Coreset technique
1 Extract a ”small” subset T from P (dubbed coreset), making sure
that it represents P well, w.r.t. solutions to Π.
2 Run best known (possibly slow) sequential algorithm for Π on the
small coreset T, rather than on the entire input P.
The technique is effective if
* T can be extracted efficiently by processing P, possibly in a
distributed or streaming fashion.
* The solution computed on T is a good solution for Π w.r.t. the
entire input P.
Composable Coreset technique
- Partition P into l subsets P1, P2, . . . , Pl, and extract a ”small”
coreset Ti from each Pi, making sure that it represents Pi well. - Run best known (possibly slow) sequential algorithm for Π on
T = ∪ Ti, rather than on P.
The technique is effective if
* Each Ti can be extracted efficiently from Pi
in parallel for all i’s.
* The final coreset T is still small and the solution computed on T is a good solution for Π w.r.t. the entire input P.
Metric Space
Typically, the input of a clustering problem consists of a set of points
from a metric space.
Definition
A metric space is an ordered pair (M, d) where M is a set and d(·) is a metric on M, i.e., a function
d : M × M → R
such that for every x, y, z ∈ M the following holds
* d(x, y) ≥ 0;
* d(x, y) = 0 if and only if x = y;
* d(x, y) = d(y, x); (symmetry)
* d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)
Distance functions
Minkowski distances
r=2: Euclidean rrenja katrore.
r=1 Manhatan distance: vec 90 gradesh.
r= infinit chebishevi..
max of all dimensions
Cosine (or angular) distance. It is used when points are vectors i
dcosine(X, Y ) = arccos(X · Y/
||X|| · ||Y ||)
Hamming distance - binary vectors
X = (0, 1, 1, 0, 1) and Y = (1, 1, 1, 0, 0). Distance is 2.
Jaccard distance:Sets; 1 - (prerje/bashkim)
Distance FUnctions and application
Minkowski: aggregates gap in each dimension between 2 objects: GPS
Cosine: measures ratio among feature values.
s. Example: distance
between documents of different lengths, where each document
is characterized by the relative frequency of words from a
given vocabulary.
Hamming distance:it measures the total number of
differences between two objects. Examples: fixed-length
binary words, genetic distance.
- Jaccard distance: it measures the ratio between the number
of differences between two sets and the total number of points
in the sets. Example: the topics describing a document.
Approximation algorithms play a crucial role i
- For several important optimization problems finding optimal solutions for large inputs is very time consuming. E.g.,
- NP-hard problems for which it is likely that exponential time is required.
-Problems for which polynomial-time algorithms exist but
require superlinear number of operations (e.g., quadratic,
cubic). - Allowing for approximations widens the spectrum of possible
solutions and makes it easier to find one. - In real-world instances, often affected by noise or uncenrtainty, the
true optimum may not be well defined.
Center-based clustering
- k-center clustering: aims at minimizing the maximum distance of
any point from the center of its cluster - k-means clustering: aims at minimizing the sum of the squared
distances of the points from the centers of their respective clusters
(a.k.a. Sum of Squared Errors (SSE)) - k-median clustering: aims at minimizing the sum of the distances
of the points from the centers of their respective clusters
Let (M, d) be a metric space. The k-center/k-means/k-median clustering
problems are optimization problems that, given a finite pointset P ⊆ M
and an integer k ≤ |P|, require to return the subset S ⊆ P of k centers
which minimizes the following respective objective functions:
- Φkcenter(P, S) = maxx∈P d(x, S) (k-center clustering)
- Φkmeans(P, S) = P
x∈P
(d(x, S))2
(k-means clustering). - Φkmedian(P, S) = P
x∈P
(d(x, S)) (k-median clustering).
Assignment primitive: Assign(P,S)
Consider a pointset P of N points represented as a key-value pairs
(IDx , x), where IDx is a distinct integer in [0, N), and x is a point.
Let also S be a set of k centers. Considering S as global data,
show that Assign(P, S) can be implemented in MapReduce in 1
round using ML = O (k).
Observations to choose.
- k-center is useful when we need to guarantee that every point
is close to a center (e.g., if centers represent locations of
crucial facilities or if they are needed as ”representatives” for
the data). As we will see later, it is sensitive to noise. - k-means/median provide guarantees on the average (square)
distances. k-means is more sensitive to noise but there exist
faster algorithms to solve it.
KCenters Farthest-First Traversal: algorithm
Popular 2-approximation sequential algorithm
Farthest-First Traversal requires k − 1 scans of
the pointset P: impractical for massive P and k not small. -> To solve it: composable coreset technique
Simple and, somewhat fast
Input Set P of N points from a metric space (M, d), integer k > 1
Output A set S of k centers which is a good solution to the k-center
problem on P (i.e., Φkcenter(P, S) “close” to Φ opt kcenter(P, k))
S ← {c1} // c1 ∈ P arbitrary point
for i ← 2 to k do
Find the point ci ∈ P − S that maximizes d(ci
, S)
S ← S ∪ {ci}
return S
Lemma
For every x ∈ P we have
d(x,T) ≤ 2 · Φ
opt
kcenter(P, k).
Farthest-First Traversal: analysis
Theorem
Let S be the set of centers returned by running Farthest-First Traversal
on P. Then:
Φkcenter(P, S) ≤ 2 · Φ
opt
kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm
…
Observations on k-center clustering
The k-center objective focuses on worst-case distance of
points from their closest centers.
- Farthest-First Traversal’s approximation guarantees are almost
the best one can obtain in practice. It was proved that
computing a c-approximate solution to k-center is NP-hard
for any fixed c < 2.
The k-center objective is very sensitive to noise. For noisy
datasets (e.g., with outliers), the clustering which optimizes
the k-center objective may obfuscate some “natural”
clustering inherent in the data. See example in the next slide.
Exact diameter computation
2-approximation to the diameter
For an arbitrary xi ∈ P define
dmax(i) = max{d(xi
, xj) : 0 ≤ j < N}.
Lemma
For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)].
Better, coreset-based diameter approximation
Coreset-based approximation:
* Fix a suitable k ≥ 2.
* Extract a coreset T ⊂ P of size k by running a k-center clustering
algorithm on P and taking the k cluster centers as set T.
* Return dT = maxx,y∈T d(x, y) as an approximation of dmax.
Can the approximation be computed efficiently?
If k = O (1), dT can be computed
* Sequentially: in O (N) time (using Farthest-First Traversal)
* In MapReduce: in 2 rounds, with local space ML = O
√
N
and
aggregate space ML = O (N) (through MR-Farthest-First Traversal)
Diversity maximization
Determine the most diverse subset of size k (for a given small k)
Diversity maximization problem: remote clique variant
Given a set P of points from a metric space (M, d) and a positive integer
k < |P|, return a subset S ⊂ P of k points, which maximizes the
diversity function
div(S) = X
x,y∈S
d(x, y).
Coreset-based approach to diversity maximization
Definition: (1 + )-coreset
Let ∈ (0, 1) be an accuracy parameter. A subset T ⊂ P is an
(1 + )-coreset for the diversity maximization problem on P if
divopt(T, k) ≥
1
1 +
divopt(P, k).
Given P and k:
* Run Farthest-First Traversal to extract a set T
0 of h > k centers
from P, for a suitable value h > k.
* Consider the h-clustering induced by T
0 on P and select k arbitrary
points from each cluster (or all points in the cluster if they are less
than k).
* Gather the at most h · k points selected from the h clusters into a
coreset T, and extract the final solution S by running the best
sequential algorithm for diversity maximization on T
What is a good choice for h?
Lloyd’s algorithm
- APPLICABILITY. Used only for k-means when M = R
D, d(·, ·) is
the standard Euclidean distance (L2-distance), and centers can be
selected outside P. - ACCURACY. If initial centers are selected well (e.g., through
k-means++) it usually provides good solutions, but, if not, it may
be trapped into local optima. - EFFICIENCY. Efficient implementations are provided by most
common software packages, but a limit on the number of iterations
is needed in case of very slow convergence. - SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
processed by a distributed platform and only few iterations are
executed. However, it is not suitable to process data streams,
k-means++ algorithm:
c1 ← random point chosen from P with uniform probability;
S ← {c1};
for 2 ≤ i ≤ k do
foreach x ∈ P − S do π(x) ← (d(x, S))2/
P
y∈P−S
(d(y, S))2
;
ci ← random point in P − S according to distribution π(·);
S ← S ∪ {ci}
return S
Observation: k-means++ is a randomized algorithm and it is known that in
expectation, the returned solution is an α-approximation, with α = Θ (ln k).
- APPLICABILITY. It can be used for both k-means and k-median,
and enforces S ⊂ P. - ACCURACY. It provides decent solutions, but it is often useful to
refine them to get better accuracy. - EFFICIENCY. Very easy and efficient implementation.
- SUITABILITY FOR MASSIVE INPUTS. Yes if data can be
processed by a distributed platform and k is small. However, it is
not suitable to process data streams.
Partitioning Around Medoids (PAM) algorithm (a.k.a. k-medoids)
- APPLICABILITY. Mainly used for k-median, and enforces S ⊂ P.
- ACCURACY. It provides good solutions.
- EFFICIENCY. Very slow since each iterations requires checking
about N · k possible swaps and the convergence can be very slow. - SUITABILITY FOR MASSIVE INPUTS. Not at all!
Coreset-based approach for k-means/median
The composable coreset technique can be employed to devise k-means/median
algorithms which are able: (i) to handle massive data (in a distributed setting);
and (ii) to provide good accuracy.
- Each point x ∈ T is given a weight
w(x) = the number of points in P for
which x is the closest representative in T. - The local coresets Ti
’s are computed
using a sequential algorithm for
k-means/median. - The final solution S is also computed
using a sequential algorithm for
k-means/median, adapted to handle
weights.
Coreset-based MapReduce algorithm for k-means
MR-kmeans(A): MapReduce algorithm for k-means (described in the
next slide) which uses a sequential algorithm A for k-means, as an
argument (functional approach!), such that:
* A solves the more general weighted variant.
* A requires space proportional to the input size.
Input Set P of N points in R
D, integer k > 1, sequential k-means
algorithm A.
Output Set S of k centers in R
D which is a good solution to the
k-means probelm on P.
Round 1:
* Map Phase: Partition P arbitrarily in ` subsets of equal size
P1, P2, . . . , P.
* Reduce Phase: for every i ∈ [1,
] separately, run A on Pi (with unit
weights) to determine a set Ti ⊆ Pi of k centers, and define
- For each x ∈ Pi, the proxy τ (x) as x’s closest center in Ti.
- For each y ∈ Ti, the weight w(y) as the number of points of Pi
whose proxy is y.
Round 2:
* Map Phase: empty.
* Reduce Phase: gather the coreset T = ∪
`
i=1Ti of ` · k points,
together with their weights, and run, using a single reducer, A on T
(with the given weights) to determine a set S = {c1, c2, . . . , ck } of
k centers, which is then returned as output.
Assume k ≤
√
N. By setting ` =
p
N/k, it is easy to see that
MR-kmeans(A) requires
* Local space ML = O (max{N/,
· k}) = O
√
N · k
= o(N)
* Aggregate space MA = O (N)
Lemma 1: coreset quality
Let T = ∪
`
i=1Ti ⊆ P be the coreset computed by MR-kmeans(A) and let
τ : P → T be the associated proxy function.
Then, T is an α-coreset for P, k and the k-means objective, that is:
X
p∈P
(d(p, τ (p)))2 ≤ α · Φ
opt
kmeans(P, k).
Lemma 2: final solution quality
Let T ⊆ P, with associated proxy function τ : P → T, be an α-coreset
for P, k and the k-means objective. Suppose that the points of T are
weighted according to the proxy function τ .
Then, the solution S computed by A on the weighted coreset T, is an
O
α
2
-approximate solution to k-means for P, that is
Φkmeans(P, S) = O
α
2
· Φ
opt
kmeans(P, k
Observations on MR-kmeans(A)
Two different k-means sequential algorithms can be used in R1 and
R2. For example, one could use k-means++ in R1, and k-means++
plus LLoyd’s in R2.
* In practice, the algorithm is fast and accurate.
* If k
0 > k centers are selected from each Pi
in R1 (e.g., through
k-means++), the quality of T, hence the quality of the final
clustering, improves.
Clustering evaluation
Quality measures:
* Specific objective function, if any, which was used to select C
among all feasible clusterings. For example
k-center/k-means/k-median objective functions which, however,
capture only intra-cluster similarity.
* Silhouette coefficient, which captures both intra-cluster similarity
and inter-cluster dissimilarity.
silhouette
Let C = (C1, C2, . . . , Ck ) be a partition of P into k clusters. For
any x ∈ P the silhouette coefficient for x is
sx =
bx − ax / max{ax , bx},
To measure the quality of C we define the average silhouette
coefficient as
sC =1/|P| * SUM sx