Week 4: Clustering 1 Flashcards

1
Q

What is the formula for Euclidean distance?

A

Square root of the sum of (Ai - Bi)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 2 main types of clustering?

A

1) Hierarchical - repeatedly combine the 2 closest clusters (or points, during the first run) to become 1 new cluster. Usually ends up with a few big clusters of points
2) Partitioning method - maintain a set number of clusters, and fit points into the cluster closest to them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What exactly is the main purpose of setting K number of unobserved clusters?

A

It is to remove any identifying labels of the points, such that there will be no bias + you might find answers you previously could not see

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When do you stop the K-means algorithm?

A

When the change in SSE is small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some challenges of the K-means algo?

A
  • Minimising SSE is computationally very difficult.
  • K-means algorithm is not guaranteed to converge to the optimal solution, and different starting seeds might produce different solutions

Because of these problems, analysts might want to estimate the K-means solution multiple times with different starting seeds and select the solution with the smallest value of SSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Since distance measure can only be used for numerical variables, how do we use it for categorical variables?

A
  • Binary or Two-value variable (0-1 variable)
  • (k-1) variables dummies where k = number of categories in the variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly