Clustering Flashcards
K-means clustering algorithm?
-Randomly assign cluster number to each observation
-Until clusters stop changing:
-for each cluster, compute centroid
-assign each observation to cluster with center closest to observation (using Euclidean distance)
What kind of distance does KMeans Clustering require and why?
-euclidean distance
- minimizes euclidean error
-without euclidean space, the centroid is meaningless
How to use kmeans with Manhattan distance?
Use PAM (partition around medoids)
- medoids - median data point
How to use kmeans with Cosine distance?
First normalize every data row, then apply kmeans (data points end up on unit circle)
What are the different specifications for linkage in hierarchical clustering?
- Single linkage - dist between closest points
- Complete linkage - dist between farthest points
- Avg linkage - mean dist between all points in the 2 clusters
- Ward distance - diff between total within cluster sum of squares for the two clusters separately and the within cluster sum of squared resulting from merging the two clusters together
Which linking method needs the distance to be metric?
- ward, centroid based need euclidean distance (they min squared error)
- single, complete, avg need symmetry and non-negative
DBScan core?
When nb of points in circle >= t
DBScan border?
When nb of points in circle < t, but has core point in circle
DBScan noise?
When nb of points < t, no core points
DBScan algortihm?
Determine core, border, noise points
Connect core points within distance e
Find all connected components
Assign border points to closest connected component
Return all components as clusters
Does DBScan need Euclidean?
No
Should we normalize observations/ features first?
No! Influences the distances.
Centroid-based clustering issues?
-stuck in local minima
-initialization might be problematic
-how to handle streaming data?
-cannot handle non-spherical clusters?
Hierarchical, density, graph based clustering issues?
-Requires computing distance matrix -> expensive
-streaming data?
Describe kmeans++
centroid c assigned randomly
for each datapoint x calculate min dist to centroid
assign next c to random point using proportional probability
Describe CURE
select few datapoints to represent cluster
run expensive (DBScan) on it to find initialization
select representatives for each cluster, make sure they show behaviour of cluster
maintain mini clusters that later are merged into larger ones
Potential problems of CURE?
representatives may be too far to merge and represent outliers, move representatives towards the centroids
Does CURE need euclidean distance?
Yes, because of modification
What are the three sets used by BFR?
- Discard set - adds to preexisting cluster
- Compressed set - mini clusters (remaining points)
- Retained set - outliers
only retained set is stored in memory
What sufficient statistics do we need for a cluster?
- n number of datapoints
- SUM -vec with sums of all feature values
- SUMQ -vec with sums of squared feature values
BFR clustering alg
For each batch
-if point in discard, update stat, discard
- else
-use standard clustering to get minis
-compute stats for minis, discard points
-keep unclustered in retained
- merge minis and retained with older minis, update stats
Does BFR need Euclidean?
Theoretically yes (uses centroids), but has been done without, if careful
What does Silhouette coefficient tell us?
- between -1 and 1
large positive values good clustering
negative value somewhat mixed
How to evaluate kmeans clusters?
Using WCV
Can BFR represent non spherical clusters?
No
can CURE represent non spherical clusters?
Yes
Does BFR discard outliers (points that do not belong to pre-existing clusters?
No
t-SNE advantages?
-can capture nonlinear relationships
-preserves local/ global structures well for visualization
-effective in practice
t-SNE disadvantages?
-only used for visualization
-parameters need to be set by hand (sensitive to initial conditions)
-slow for large datasets
what does the k in UMAP do?
large k- preserves global structure
small k- preserves local structure
k-nb of nearest neighbours
UMAP advantages?
-control over local vs global preservation
-fast
-well separated clusters
UMAP disadvantages?
-k should be optimized
-distance info still lost (only visualization)
-sensitive to optim. parameters and initial conditions