clustering Flashcards
what is a cluster?
a group of data points that are more similar to each other than to other dat points in other groups
what is the goal of clustering?
to seperate data pts into group, how ever the group we want to seperate them to are unknown before hand
what are some limitations to clustering?
- no clear census on best clustering method
- cluster algorithms work best on linearly seperated data, sphereical data and convex data
- assumes that variables being measured are equally as important to separating groups so they are usually standardized
what are factors that affect clustering outcomes?
choice of distance funcitons
choice of number of groups
choice of inital groups
what are two types of hierarchical clustering?
allgomerative where you start with each point in it’s own group and merge them according to different linkage rules, until every point is in the them same group
then there’s divisive which goes the other way around into each point is in it’s own group
what are the steps for allgomerative clustering?
- choose distance measure
- choose linkage rule to determine which points w echoose to compare distances b/w when determining group membership
- Start by requireing that pts have distance 0 w/ in a group
4.Increaes distance threshold so that pts w/ higher distances are considering a groupand then repeat till every point is in the same groupo - create dendogram
what are the linkage rules?
Closest neighbor – Two groups are merged if the closest two points are within the distance threshold.
Farthest neighbor – Two groups are merged if the ALL points within the two groups are within the distance threshold.
Group average – Two groups are merged if the average of the pairwise distances between points in each group.
when would you use closest neighbor vs fartherest neighbor?
Nearest neighbor is not sensitive to outliers, since only one near point is needed to merge groups, but it struggles when groups are close or overlap.
Farthest neighbor does better when groups are close/overlap, but is very sensitive to outliers.
what is k means clustering?
gorups data into k distinct clusters based on how close they are to the cetner of each group
what arethe steps for k means clustering?
- choose k groups thru forgy method which picks k data points as centriods or randomly thru random partition
- make inital choices of data for the k group cnetroids
- assign each point to the group w/ the nearest centroid
- recompute controids and repeat until gropus don’t change or stabilize
Why is choosing the right value of k important in K-means clustering?
Because k directly affects the results of clustering — some datasets have an obvious number of clusters, but others do not.
What is one common method for choosing k?
Use a scree plot and look for an elbow in the within-cluster variation — the point where adding more clusters stops improving the model significantly.
What does the silhouette score measure?
It compares a point’s average distance to its own cluster vs. the next closest cluster — higher scores mean better clustering.
What is the gap statistic?
It compares the observed within-cluster variation to the variation expected if the data were randomly distributed.
What does K-means minimize, and what’s a limitation of this?
It minimizes within-cluster distance, but it only guarantees a local minimum — results depend on initialization.
What is the Forgy method for initializing K-means?
It randomly selects k actual data points as initial centroids, helping to spread them out — but may choose outliers.
What is the Random Partition method for initializing K-means?
It randomly assigns points to clusters, then computes centroids from those — more centered but can fail with imbalanced clusters.
What are the pros and cons of the Forgy method?
Pros: spreads out centroids.
Cons: vulnerable to picking outliers as initial centers.
What are the pros and cons of the Random Partition method?
Pros: Resistant to outliers.
Cons: centroids may be too close together, especially in uneven data.
What is K-means++ and why is it preferred?
A smarter initialization method that spreads out initial centroids systematically, reducing the chance of poor clustering.