K-Means, GMM, Cluster Validation Flashcards
What is an issue with K-Means that GMM solves?
K-Means is hard clustering, which causes inflexibility.
GMM is soft clustering and bases classification on probabilities.
Give the steps for K-Means
- Choose K centroids in the data space
- Classify data points to the nearest centroid
- Recompute the centroids to the centre of the clusters
- Calculate the variance of each cluster
- Repeat until termination criteria is met.W
What is the termination criteria for K-Means?
- No more data reallocations
- No change in centroid position
- No change in squared error
- N iterations are conducted
What is the time complexity of K-Means?
O(NK) -> O(N)
What are some cons of using K-Means?
- Doesn’t handle outliers well.
- Effected by changes in shape, size and density of clusters.
- Hard clustering -> Points can only be in one cluster.
Explain how GMM works…
- A soft clustering algorithm that gives probabilities of each point belonging to a cluster.
- Each cluster has a gaussian function
- Data point X is run through the gaussian function to establish probability that the data point belongs to that cluster.
Explain what the distance matrix is in cluster validation…
A matrix that represents the euclidean distance between every data point.
Explain what the incidence matrix is in cluster validation…
A matrix that highlights clusters. Clustered elements hold a value of 1, non-cluster elements are 0.
What are the 3 types of validation? Define each…
External: Compares clustering to externally supplied and labeled clusters. Uses Distance and Incidence matrix.
Internal: Conducts assessment internally via cluster cohesion and separation metrics.
Relative:
Define the 2 measures of internal validation…
Cohesion : How closely related all the data points within a cluster are.
Separation : The distance between separate clusters.
What does the silhouette score measure? What is the best score?
Similarity of an object to it’s own cluster. Measure for every object in a cluster and graph on a bar chart to establish cluster cohesion and separation.
Best = 1, worst = -1