K-Means, GMM, Cluster Validation Flashcards

Question 1

Q

What is an issue with K-Means that GMM solves?

Answer

A

K-Means is hard clustering, which causes inflexibility.

GMM is soft clustering and bases classification on probabilities.

Question 2

Q

Give the steps for K-Means

Answer

A

Choose K centroids in the data space
Classify data points to the nearest centroid
Recompute the centroids to the centre of the clusters
Calculate the variance of each cluster
Repeat until termination criteria is met.W

Question 3

Q

What is the termination criteria for K-Means?

Answer

A

No more data reallocations
No change in centroid position
No change in squared error
N iterations are conducted

Question 4

Q

What is the time complexity of K-Means?

Answer

A

O(NK) -> O(N)

Question 5

Q

What are some cons of using K-Means?

Answer

A

Doesn’t handle outliers well.
Effected by changes in shape, size and density of clusters.
Hard clustering -> Points can only be in one cluster.

Question 6

Q

Explain how GMM works…

Answer

A

A soft clustering algorithm that gives probabilities of each point belonging to a cluster.
Each cluster has a gaussian function
Data point X is run through the gaussian function to establish probability that the data point belongs to that cluster.

Question 7

Q

Explain what the distance matrix is in cluster validation…

Answer

A

A matrix that represents the euclidean distance between every data point.

Question 8

Q

Explain what the incidence matrix is in cluster validation…

Answer

A

A matrix that highlights clusters. Clustered elements hold a value of 1, non-cluster elements are 0.

Question 9

Q

What are the 3 types of validation? Define each…

Answer

A

External: Compares clustering to externally supplied and labeled clusters. Uses Distance and Incidence matrix.

Internal: Conducts assessment internally via cluster cohesion and separation metrics.

Relative:

Question 10

Q

Define the 2 measures of internal validation…

Answer

A

Cohesion : How closely related all the data points within a cluster are.

Separation : The distance between separate clusters.

Question 11

Q

What does the silhouette score measure? What is the best score?

Answer

A

Similarity of an object to it’s own cluster. Measure for every object in a cluster and graph on a bar chart to establish cluster cohesion and separation.

Best = 1, worst = -1

K-Means, GMM, Cluster Validation Flashcards

(11 cards)