Clustering Flashcards
K-means clustering algorithm?
-Randomly assign cluster number to each observation
-Until clusters stop changing:
-for each cluster, compute centroid
-assign each observation to cluster with center closest to observation (using Euclidean distance)
What kind of distance does KMeans Clustering require and why?
-euclidean distance
- minimizes euclidean error
-without euclidean space, the centroid is meaningless
How to use kmeans with Manhattan distance?
Use PAM (partition around medoids)
- medoids - median data point
How to use kmeans with Cosine distance?
First normalize every data row, then apply kmeans (data points end up on unit circle)
What are the different specifications for linkage in hierarchical clustering?
- Single linkage - dist between closest points
- Complete linkage - dist between farthest points
- Avg linkage - mean dist between all points in the 2 clusters
- Ward distance - diff between total within cluster sum of squares for the two clusters separately and the within cluster sum of squared resulting from merging the two clusters together
Which linking method needs the distance to be metric?
- ward, centroid based need euclidean distance (they min squared error)
- single, complete, avg need symmetry and non-negative
DBScan core?
When nb of points in circle >= t
DBScan border?
When nb of points in circle < t, but has core point in circle
DBScan noise?
When nb of points < t, no core points
DBScan algortihm?
Determine core, border, noise points
Connect core points within distance e
Find all connected components
Assign border points to closest connected component
Return all components as clusters
Does DBScan need Euclidean?
No
Should we normalize observations/ features first?
No! Influences the distances.
Centroid-based clustering issues?
-stuck in local minima
-initialization might be problematic
-how to handle streaming data?
-cannot handle non-spherical clusters?
Hierarchical, density, graph based clustering issues?
-Requires computing distance matrix -> expensive
-streaming data?
Describe kmeans++
centroid c assigned randomly
for each datapoint x calculate min dist to centroid
assign next c to random point using proportional probability