Cluster Analysis P2 Flashcards
What is the cosine similarity distance measure?
For high-dimensional text data or sparse vectors
What is the manhattan distance distance measure used for?
horizontal/vertical steps
How is clustering subjective?
Types of clustering methods can have different conclusions
How is clustering ambiguous?
Same data can have different results depending on settings/algorithm
What are the 2 method in partitioning approach?
- Global optimal
- Heuristic methods
What is global optimal?
Exhaustively enumerator all partitions
Name the two heuristic methods…
- K means
- K medoids
What is k means?
Each cluster is represented by the custer center (mean)
What is k medoids?
Each cluster is represented by one of the objectis in the cluster
What are the four steps of k means clustering?
- Partition into k clusters randomly
- Assign cluster with closest centroids using euclidean distance
- Recompute centroids
- Stop when cluster assignment is unchanged
What are the two common options for normalization?
- Convert to z scores
- Rescale to 0 - 1
How would to convert to z score?
Subtracting the mean and dividing by the standard deviation
How would you rescale to 0-1?
Subtracting the min value and then dividing by the range (max-min)
Xnorm = (x-min) / (max - min)
What are the 4 strengths of k means?
- Simplicity & efficiency
- Scalable
- Flexible
- Interpretability
How is k means simple and efficient?
Straightforward to implement and computationally efficient
How is k means scalable?
Well with large data and faster
How is k means flexibile?
Roughly spherical and evenly sized
How is k means interpretable?
Easy to understand & explain
What are the 6 weaknesses of using k means?
- Input number of clusters
- Initialization sensitivity
- Outlier sensitivity
- Magnitude sensitivity
- Irregular clusters
- Distance metric dependency
What is the most common measure when evaluating k mean clusters?
SSE (sum of squared error)