L14 - K Means, GMM, Clustering Measures Flashcards
Hierarchical clustering is a deterministic method. What does this mean?
We can repeat the method and get the same results. Generating the same dendrogram.
What is the issue with Hierarchical clustering?
Slow -> Can hit O(N^3)
What are ‘partition clustering methods’?
- Partitions dimensional space as opposed to clustering points.
- Breaks space into K partitions that are non-overlapping
- Then we can compare N data-points to K clusters
What is K-means clustering?
A partition clustering method
What is the process of K-means clustering…
- Input N data points and K cluster
- Choose K random data points to be the initial cluster centroids
- Assign every point in N to it’s nearest K centroid
- Recompute the centroid of each cluster to be the actual centroid
- Check for convergence or termination criteria
- Repeat 3 to 5
What are some options for the terminations criteria?
- Few or no re-assignment of data points to different clusters
- Few or no change to centroids
- Minimal change in sum of square errors (SSE)
What are the pros and cons of using K-means?
- Pros:
1. Easy to understand and implement
2. Efficient -> O(ndk*t) -> O(n) i.e Linear complexity - Cons:
1. Requires a centroid to be calculable -> Not possible with categorical data
2. Effectiveness dependent on selection of K
3. Sensitive to outlier
What are the limitations of K-means?
- Struggles with clusters of different sizes
- Struggles with clusters of different shapes
- Struggles with clusters of different density
- Results depend on starting centroids -> Same data can be clustered completely differently with different initial centroids.
How do we overcome the limitations of K-means?
Pre-processing -> Eliminating outliers or standing or normalising the data.
Post processing -> Merge clusters with low SSE, Split loose clusters, Eliminate outlier clusters.
What method can we use to choose K?
Elbow method -> Plot SSE for multiple values of K and find the elbow.
Since K-means is sensitive to outliers, what clustering method can be used as an alternative?
GMM -> Gaussian Mixture Models
What is GMM? What issues of K-means does it solve?
- A clustering method that clusters based on probabilities
- E.g A is 30% clusters blue and 70% cluster green
- Solves issue in K-means regarding cluster outliers
What are the pros and cons of GMM compared to K-means?
- Pros:
1. Convergence quicker
2. Deals well with outliers - Cons:
1. More complex
2. More computationally expensive
Give main differences between K-means and GMM
- K-means:
1. Minimises sum of squared distances
2. Hard assignment
3. Assumes spherical clustering - GMM:
1. Maximises log-likelihood
2. Soft assignment of points to clusters
3. Assumes elliptical clusters
How can we tell if our clustering is correct?
- External validation -> Measure against externally supplied labels -> Use distance or incidence matrix
- Internal validation -> Measure whether points that should be close/far from each other are actually so -> Use silhouette coefficient
- Relative validation -> Compare one clustering method to another and see if they agree