Clustering Flashcards
Unsupervised learning introduction
We have the input features X, but we do not have the labels y.
We are interested in finding some interesting structure in the data, or , equivalently, to organize it in some meaningful way.
Clustering definition and difficulties
Clustering: task of grouping objects such that similar objects end up in the same group and dissimilar objects are separated into different groups.
Difficulties:
- Similarity (or proximity) is not a transitive relation: Example, sequence of objects, x1,…, xm such that each xi is very similar to its two neiighbors, xi-1 and xi+1, but x1 and xm are very dissimilar
- Lack of ground truth: a given set of objects can be clustered in various different meaningful ways
Model for Clustering
Let’s formulate the clustering problem more formally:
- Input: set of elements X and distance function d: X x X -> R+ that is
symmetric
d(x,x) = 0
d satisfies the triangle inequality - Output: a partition of X into clusters that is C = (C1,…,Ck) with
their union Ci = X
for all i not j; Ci intersection Cj = 0
k-means clustering and Lloyd’s algorithm
Cost minimization in Clustering
- define a cost function over possible partitions of the objects
- find the partition (=clustering) of minimal cost
assumptions:
- data points x in X come from a larger space X’, that is X proper subset X’
- distance function d(x,x’) for x,x’ in X
for simplicity X’ = R^d and d(x,x’) = ||x-x’||
k-means clustering
input: data points x1, … , xm; k in N+
Goal: find
- partition C = (C1, … , Ck) of x1, … , xm
- centers mu1,…, mui in X’ center for Ci
that minimizes the k-means objective (cost)
SUMSUM d(x,mui)^2
to solve k means:
- Llyod’s Algorithm O(tkmd), where t = #ofiterations, m = #datapoints
Input: data points X = {x1,…xm}
Output: clustering C of X; centers
Pseudocode
k-means++
Linkage-based clustering