E6 Flashcards
What is clustering?
• Finding groups in data.
• Organizing data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.
Is clustering the same as classification?
No.
- Class labels can be found directly in the data. E.g., blood type.
- Different goals: to “understand” the data better (explore), to organize the information we have.
Distance measures
• Euclidean distance
-> physical distance between two data points
• Manhattan distance
-> taxicab distance -> absolute difference
• Jaccard distance
-> treat two objects as sets of characteristics (text mining same word)
• Cosine distance
-> cosine of angle between two vectors (often text mining/recommend)
• Edit distance
-> Levenshtein metric -> autocorrect (spelling mistakes)
K-means clustering - how to?
- Select proximity measure and specify the number of clusters (k).
- Initiate the process by selecting centroids.
- Assign the data points to the “nearest” centroid to form a cluster.
- Calculate the new centroid.
- Iterate over steps 3 and 4 until the stopping criteria are fulfilled.
Strengths k-means
- Simple
- Efficient
Weaknesses k-means
- the value of k – how to determine it?
- converges to locally optimal solution
- globular/spherical clusters
Hierarchical clustering
Creates a collection of ways to group the points
Output of hierarchical clustering
Dendograms
Strengths hierarchical clustering
- Clusters can be of any size and shape.
* Does not require to prespecify the number of clusters.
Weaknesses hierarchical clustering
- Still need to decide where to split.
* Computationally inefficient.