clustering Flashcards

Question 1

Q

what is a cluster?

Answer

A

a group of data points that are more similar to each other than to other dat points in other groups

Question 2

Q

what is the goal of clustering?

Answer

A

to seperate data pts into group, how ever the group we want to seperate them to are unknown before hand

Question 3

Q

what are some limitations to clustering?

Answer

A

no clear census on best clustering method
cluster algorithms work best on linearly seperated data, sphereical data and convex data
assumes that variables being measured are equally as important to separating groups so they are usually standardized

Question 4

Q

what are factors that affect clustering outcomes?

Answer

A

choice of distance funcitons
choice of number of groups
choice of inital groups

Question 5

Q

what are two types of hierarchical clustering?

Answer

A

allgomerative where you start with each point in it’s own group and merge them according to different linkage rules, until every point is in the them same group
then there’s divisive which goes the other way around into each point is in it’s own group

Question 6

Q

what are the steps for allgomerative clustering?

Answer

A

choose distance measure
choose linkage rule to determine which points w echoose to compare distances b/w when determining group membership
Start by requireing that pts have distance 0 w/ in a group
4.Increaes distance threshold so that pts w/ higher distances are considering a groupand then repeat till every point is in the same groupo
create dendogram

Question 7

Q

what are the linkage rules?

Answer

A

Closest neighbor – Two groups are merged if the closest two points are within the distance threshold.
Farthest neighbor – Two groups are merged if the ALL points within the two groups are within the distance threshold.
Group average – Two groups are merged if the average of the pairwise distances between points in each group.

Question 8

Q

when would you use closest neighbor vs fartherest neighbor?

Answer

A

Nearest neighbor is not sensitive to outliers, since only one near point is needed to merge groups, but it struggles when groups are close or overlap.
Farthest neighbor does better when groups are close/overlap, but is very sensitive to outliers.

Question 9

Q

what is k means clustering?

Answer

A

gorups data into k distinct clusters based on how close they are to the cetner of each group

Question 10

Q

what arethe steps for k means clustering?

Answer

A

choose k groups thru forgy method which picks k data points as centriods or randomly thru random partition
make inital choices of data for the k group cnetroids
assign each point to the group w/ the nearest centroid
recompute controids and repeat until gropus don’t change or stabilize

Question 11

Q

Why is choosing the right value of k important in K-means clustering?

Answer

A

Because k directly affects the results of clustering — some datasets have an obvious number of clusters, but others do not.

Question 12

Q

What is one common method for choosing k?

Answer

A

Use a scree plot and look for an elbow in the within-cluster variation — the point where adding more clusters stops improving the model significantly.

Question 13

Q

What does the silhouette score measure?

Answer

A

It compares a point’s average distance to its own cluster vs. the next closest cluster — higher scores mean better clustering.

Question 14

Q

What is the gap statistic?

Answer

A

It compares the observed within-cluster variation to the variation expected if the data were randomly distributed.

Question 15

Q

What does K-means minimize, and what’s a limitation of this?

Answer

A

It minimizes within-cluster distance, but it only guarantees a local minimum — results depend on initialization.

Question 16

Q

What is the Forgy method for initializing K-means?

Answer

Study These Flashcards

A

It randomly selects k actual data points as initial centroids, helping to spread them out — but may choose outliers.

Question 17

Q

What is the Random Partition method for initializing K-means?

Answer

Study These Flashcards

A

It randomly assigns points to clusters, then computes centroids from those — more centered but can fail with imbalanced clusters.

Question 18

Q

What are the pros and cons of the Forgy method?

Answer

Study These Flashcards

A

Pros: spreads out centroids.

Cons: vulnerable to picking outliers as initial centers.

Question 19

Q

What are the pros and cons of the Random Partition method?

Answer

Study These Flashcards

A

Pros: Resistant to outliers.

Cons: centroids may be too close together, especially in uneven data.

Question 20

Q

What is K-means++ and why is it preferred?

Answer

Study These Flashcards

A

A smarter initialization method that spreads out initial centroids systematically, reducing the chance of poor clustering.

clustering Flashcards

(20 cards)