Clustering Flashcards
What is clustering?
Unsupervised learning
Dimensionality reduction
Finds hidden structure in unlabelled data
Why cluster?
Detect outliers
Simplify data
Visualise data
Goal of clusters in clustering?
Maximise intra-cluster similarity
Minimise inter-cluster similarity
Clustering vs Classification
Classification: discriminate against groups based on attributes
Clustering: determine these discriminatory attributes
What are the 2 types of clustering?
- Partitional
2. Hierarchical
Partitional?
division of data into non overlapping clusters
K-means clustering
Hierarchical?
division of data into overlapping clusters
Dendrogram
Agglomerative - bottom up
Divisive - top down
What is K means clustering?
Partition clustering
Select K clusters in advance - disadvantage
Easy to implement and quick - advantage
Whats the algorithm huh???
For each point x:
Find nearest centroid c - euclidean distance
Assign x to c
For each cluster c:
Recalculate as average of all associated points
What are the convergence criteria?
No/min point reassignments
No/min changes in centroids
No/min change in SSE
What is SSE?
Some of square errors
Calculates the sum of squared distances between points in a cluster and the centroid of said cluster
Benefits of K-means?
Simple
Fast - O(TKN)
Always converges
Disadvantages of K-means?
Need to specify k in advance
Only applicable if mean is defined
(If data is categorical centroids can be represented by the mode)
Sensitive to outliers
Cannot be used for hyper ellipsoids/spheres