Coreset Flashcards
Coreset technique
Suppose that we want to solve a problem Π on instances P which are too
large to be processed by known algorithms.
Coreset technique
1 Extract a ”small” subset T from P (dubbed coreset), making sure
that it represents P well, w.r.t. solutions to Π.
2 Run best known (possibly slow) sequential algorithm for Π on the
small coreset T, rather than on the entire input P.
The technique is effective if
* T can be extracted efficiently by processing P, possibly in a
distributed or streaming fashion.
* The solution computed on T is a good solution for Π w.r.t. the
entire input P.
Composable Coreset technique
- Partition P into l subsets P1, P2, . . . , Pl, and extract a ”small”
coreset Ti from each Pi, making sure that it represents Pi well. - Run best known (possibly slow) sequential algorithm for Π on
T = ∪ Ti, rather than on P.
The technique is effective if
* Each Ti can be extracted efficiently from Pi
in parallel for all i’s.
* The final coreset T is still small and the solution computed on T is a good solution for Π w.r.t. the entire input P.
Metric Space
Typically, the input of a clustering problem consists of a set of points
from a metric space.
Definition
A metric space is an ordered pair (M, d) where M is a set and d(·) is a metric on M, i.e., a function
d : M × M → R
such that for every x, y, z ∈ M the following holds
* d(x, y) ≥ 0;
* d(x, y) = 0 if and only if x = y;
* d(x, y) = d(y, x); (symmetry)
* d(x, z) ≤ d(x, y) + d(y, z); (triangle inequality)
Distance functions
Minkowski distances
r=2: Euclidean rrenja katrore.
r=1 Manhatan distance: vec 90 gradesh.
r= infinit chebishevi..
max of all dimensions
Cosine (or angular) distance. It is used when points are vectors i
dcosine(X, Y ) = arccos(X · Y/
||X|| · ||Y ||)
Hamming distance - binary vectors
X = (0, 1, 1, 0, 1) and Y = (1, 1, 1, 0, 0). Distance is 2.
Jaccard distance:Sets; 1 - (prerje/bashkim)
Distance FUnctions and application
Minkowski: aggregates gap in each dimension between 2 objects: GPS
Cosine: measures ratio among feature values.
s. Example: distance
between documents of different lengths, where each document
is characterized by the relative frequency of words from a
given vocabulary.
Hamming distance:it measures the total number of
differences between two objects. Examples: fixed-length
binary words, genetic distance.
- Jaccard distance: it measures the ratio between the number
of differences between two sets and the total number of points
in the sets. Example: the topics describing a document.
Approximation algorithms play a crucial role i
- For several important optimization problems finding optimal solutions for large inputs is very time consuming. E.g.,
- NP-hard problems for which it is likely that exponential time is required.
-Problems for which polynomial-time algorithms exist but
require superlinear number of operations (e.g., quadratic,
cubic). - Allowing for approximations widens the spectrum of possible
solutions and makes it easier to find one. - In real-world instances, often affected by noise or uncenrtainty, the
true optimum may not be well defined.
Center-based clustering
- k-center clustering: aims at minimizing the maximum distance of
any point from the center of its cluster - k-means clustering: aims at minimizing the sum of the squared
distances of the points from the centers of their respective clusters
(a.k.a. Sum of Squared Errors (SSE)) - k-median clustering: aims at minimizing the sum of the distances
of the points from the centers of their respective clusters
Let (M, d) be a metric space. The k-center/k-means/k-median clustering
problems are optimization problems that, given a finite pointset P ⊆ M
and an integer k ≤ |P|, require to return the subset S ⊆ P of k centers
which minimizes the following respective objective functions:
- Φkcenter(P, S) = maxx∈P d(x, S) (k-center clustering)
- Φkmeans(P, S) = P
x∈P
(d(x, S))2
(k-means clustering). - Φkmedian(P, S) = P
x∈P
(d(x, S)) (k-median clustering).
Assignment primitive: Assign(P,S)
Consider a pointset P of N points represented as a key-value pairs
(IDx , x), where IDx is a distinct integer in [0, N), and x is a point.
Let also S be a set of k centers. Considering S as global data,
show that Assign(P, S) can be implemented in MapReduce in 1
round using ML = O (k).
Observations to choose.
- k-center is useful when we need to guarantee that every point
is close to a center (e.g., if centers represent locations of
crucial facilities or if they are needed as ”representatives” for
the data). As we will see later, it is sensitive to noise. - k-means/median provide guarantees on the average (square)
distances. k-means is more sensitive to noise but there exist
faster algorithms to solve it.
KCenters Farthest-First Traversal: algorithm
Popular 2-approximation sequential algorithm
Farthest-First Traversal requires k − 1 scans of
the pointset P: impractical for massive P and k not small. -> To solve it: composable coreset technique
Simple and, somewhat fast
Input Set P of N points from a metric space (M, d), integer k > 1
Output A set S of k centers which is a good solution to the k-center
problem on P (i.e., Φkcenter(P, S) “close” to Φ opt kcenter(P, k))
S ← {c1} // c1 ∈ P arbitrary point
for i ← 2 to k do
Find the point ci ∈ P − S that maximizes d(ci
, S)
S ← S ∪ {ci}
return S
Lemma
For every x ∈ P we have
d(x,T) ≤ 2 · Φ
opt
kcenter(P, k).
Farthest-First Traversal: analysis
Theorem
Let S be the set of centers returned by running Farthest-First Traversal
on P. Then:
Φkcenter(P, S) ≤ 2 · Φ
opt
kcenter(P, k).
That is, Farthest-First Traversal is a 2-approximation algorithm
…
Observations on k-center clustering
The k-center objective focuses on worst-case distance of
points from their closest centers.
- Farthest-First Traversal’s approximation guarantees are almost
the best one can obtain in practice. It was proved that
computing a c-approximate solution to k-center is NP-hard
for any fixed c < 2.
The k-center objective is very sensitive to noise. For noisy
datasets (e.g., with outliers), the clustering which optimizes
the k-center objective may obfuscate some “natural”
clustering inherent in the data. See example in the next slide.
Exact diameter computation
2-approximation to the diameter
For an arbitrary xi ∈ P define
dmax(i) = max{d(xi
, xj) : 0 ≤ j < N}.
Lemma
For any 0 ≤ i < N we have dmax ∈ [dmax(i), 2dmax(i)].
Better, coreset-based diameter approximation
Coreset-based approximation:
* Fix a suitable k ≥ 2.
* Extract a coreset T ⊂ P of size k by running a k-center clustering
algorithm on P and taking the k cluster centers as set T.
* Return dT = maxx,y∈T d(x, y) as an approximation of dmax.
Can the approximation be computed efficiently?
If k = O (1), dT can be computed
* Sequentially: in O (N) time (using Farthest-First Traversal)
* In MapReduce: in 2 rounds, with local space ML = O
√
N
and
aggregate space ML = O (N) (through MR-Farthest-First Traversal)
Diversity maximization
Determine the most diverse subset of size k (for a given small k)
Diversity maximization problem: remote clique variant
Given a set P of points from a metric space (M, d) and a positive integer
k < |P|, return a subset S ⊂ P of k points, which maximizes the
diversity function
div(S) = X
x,y∈S
d(x, y).