Week 6 Flashcards
What is the aim of cluster analysis?
To create a grouping of objects such that objects within a group are similar and objects in different groups are not similar.
How is a cluster defined in K-means clustering?
It is a representative point, the mean of the objects that are assigned to the cluster.
What is muk in K-means clustering?
The mean pont for the k-th cluster.
What is znk in K-means clustering?
A binary indictaor variable that is 1 if object n is assigned to cluster k and 0 otherwise.
What requirement leads to SIGMAk z nk = 1 in K-means clustering?
Each object has to be assigned to one and only one cluster.
What is the equation for muk in K-means clustering?
muk = (SIGMAk of znk xn) / (SIGMA n of znk)
What are the four steps performed in K-means clustering?
You start with initial values for the cluster means, mu1, …, muk.
- For each data object xn, find the closes cluster mean and set znk =1 and znj = 0 for all j != k.
- Stop if all znk are unchanged compared to the previous iteration.
- Update each muk.
- Return to step 1.
What is the equation for D, the total distance between objects and their cluster centers, in K-means clustering?
D = SIGMA n=1 to n SIGMA k=1 to K znk (xn - muk)T (xn - muk)
What solution do we use to prevent K-means clustering from only reaching a local minimum?
We run the algorithm from several random starting points and use the solution that gives the lowest value of D, the total distance.
What is the problem with using D, the total distance, as a model selection criteria in K-means clustering?
It decreases as K increases, so more clusters lead to a better result. This shows that, just like maximum likelihood, it kinda measures complexity.
What is clustered in feature selection?
Features are clustered based on their values across objects rather than clustering the objects.
Parametric density estimation
When the distribution of x given a class is a single model. So when p(x|Ci) has one model.
Semiparametric density estimation
When you assume a mixture of different models, so when p(x|Ci) is a mixture of densities.
Explain the elbow plot:
It’s the plot with the amount of clusters (K) on the x-axis and the reduction in variation within clusters on the y-axis. There is a decrease in reduction at some point, called the elbow.
How do you calculate Euclidian distance between two points?
root(x2 + y2)