K-means clustering Flashcards
1
Q
Goal
A
It's an un-supervised learning problem. Goal is to partition the input points x1, x2,..., xN into K sets S1, S2, … , Sk and select centers (centroids) μ1, μ2, …, μk for each cluster.
2
Q
Performance index
A
- Quality of cluster j:
Ej = sum(xn app Sj) || xn - μj ||^2 - Error function
Ein(S1,S2,…,Sk,μ1,…,μk) = sum(j=1,k) Ej = sum(n=1,N) || xn - μ(xn) ||^2
Minimizing the error function Ein in an NP-hard problem, so Lloyd’s algorithm is used.
3
Q
Lloyd’s algorithm
A
- if the partition is fixed, the optimal centroids are easy to obtain
- if the centroids are fixed, the optimal partition is easy to obtain
Algorithm
Given k
1) Initialize μj
2) Construct Sj as all points closest to μj
3) Update each μj to equal the centroid of Sj
4) Repeat 2 and 3 until Ein stops decreasing
Ein cannot increase with 2 and 3, so the alorithm eventually stops.
4
Q
How to initialize μj
A
- start taking an arbitrary point as center
- as a second center, take the point furthest from the first center
- to get each new center, keep adding the point furthest from the current set of centers
- stop when k centers are selected