Week 4: Clustering 1 Flashcards
What is the formula for Euclidean distance?
Square root of the sum of (Ai - Bi)^2
What are the 2 main types of clustering?
1) Hierarchical - repeatedly combine the 2 closest clusters (or points, during the first run) to become 1 new cluster. Usually ends up with a few big clusters of points
2) Partitioning method - maintain a set number of clusters, and fit points into the cluster closest to them
What exactly is the main purpose of setting K number of unobserved clusters?
It is to remove any identifying labels of the points, such that there will be no bias + you might find answers you previously could not see
When do you stop the K-means algorithm?
When the change in SSE is small
What are some challenges of the K-means algo?
- Minimising SSE is computationally very difficult.
- K-means algorithm is not guaranteed to converge to the optimal solution, and different starting seeds might produce different solutions
Because of these problems, analysts might want to estimate the K-means solution multiple times with different starting seeds and select the solution with the smallest value of SSE.
Since distance measure can only be used for numerical variables, how do we use it for categorical variables?
- Binary or Two-value variable (0-1 variable)
- (k-1) variables dummies where k = number of categories in the variable