clustering Flashcards

1
Q

what is a cluster?

A

a group of data points that are more similar to each other than to other dat points in other groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is the goal of clustering?

A

to seperate data pts into group, how ever the group we want to seperate them to are unknown before hand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what are some limitations to clustering?

A
  1. no clear census on best clustering method
  2. cluster algorithms work best on linearly seperated data, sphereical data and convex data
  3. assumes that variables being measured are equally as important to separating groups so they are usually standardized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are factors that affect clustering outcomes?

A

choice of distance funcitons
choice of number of groups
choice of inital groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what are two types of hierarchical clustering?

A

allgomerative where you start with each point in it’s own group and merge them according to different linkage rules, until every point is in the them same group
then there’s divisive which goes the other way around into each point is in it’s own group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are the steps for allgomerative clustering?

A
  1. choose distance measure
  2. choose linkage rule to determine which points w echoose to compare distances b/w when determining group membership
  3. Start by requireing that pts have distance 0 w/ in a group
    4.Increaes distance threshold so that pts w/ higher distances are considering a groupand then repeat till every point is in the same groupo
  4. create dendogram
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what are the linkage rules?

A

Closest neighbor – Two groups are merged if the closest two points are within the distance threshold.
Farthest neighbor – Two groups are merged if the ALL points within the two groups are within the distance threshold.
Group average – Two groups are merged if the average of the pairwise distances between points in each group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

when would you use closest neighbor vs fartherest neighbor?

A

Nearest neighbor is not sensitive to outliers, since only one near point is needed to merge groups, but it struggles when groups are close or overlap.
Farthest neighbor does better when groups are close/overlap, but is very sensitive to outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is k means clustering?

A

gorups data into k distinct clusters based on how close they are to the cetner of each group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what arethe steps for k means clustering?

A
  1. choose k groups thru forgy method which picks k data points as centriods or randomly thru random partition
  2. make inital choices of data for the k group cnetroids
  3. assign each point to the group w/ the nearest centroid
  4. recompute controids and repeat until gropus don’t change or stabilize
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is choosing the right value of k important in K-means clustering?

A

Because k directly affects the results of clustering — some datasets have an obvious number of clusters, but others do not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is one common method for choosing k?

A

Use a scree plot and look for an elbow in the within-cluster variation — the point where adding more clusters stops improving the model significantly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does the silhouette score measure?

A

It compares a point’s average distance to its own cluster vs. the next closest cluster — higher scores mean better clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the gap statistic?

A

It compares the observed within-cluster variation to the variation expected if the data were randomly distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does K-means minimize, and what’s a limitation of this?

A

It minimizes within-cluster distance, but it only guarantees a local minimum — results depend on initialization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Forgy method for initializing K-means?

A

It randomly selects k actual data points as initial centroids, helping to spread them out — but may choose outliers.

17
Q

What is the Random Partition method for initializing K-means?

A

It randomly assigns points to clusters, then computes centroids from those — more centered but can fail with imbalanced clusters.

18
Q

What are the pros and cons of the Forgy method?

A

Pros: spreads out centroids.

Cons: vulnerable to picking outliers as initial centers.

19
Q

What are the pros and cons of the Random Partition method?

A

Pros: Resistant to outliers.

Cons: centroids may be too close together, especially in uneven data.

20
Q

What is K-means++ and why is it preferred?

A

A smarter initialization method that spreads out initial centroids systematically, reducing the chance of poor clustering.