Cluster Analysis P2 Flashcards

1
Q

What is the cosine similarity distance measure?

A

For high-dimensional text data or sparse vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the manhattan distance distance measure used for?

A

horizontal/vertical steps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is clustering subjective?

A

Types of clustering methods can have different conclusions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is clustering ambiguous?

A

Same data can have different results depending on settings/algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 2 method in partitioning approach?

A
  1. Global optimal
  2. Heuristic methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is global optimal?

A

Exhaustively enumerator all partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name the two heuristic methods…

A
  1. K means
  2. K medoids
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is k means?

A

Each cluster is represented by the custer center (mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is k medoids?

A

Each cluster is represented by one of the objectis in the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the four steps of k means clustering?

A
  1. Partition into k clusters randomly
  2. Assign cluster with closest centroids using euclidean distance
  3. Recompute centroids
  4. Stop when cluster assignment is unchanged
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two common options for normalization?

A
  1. Convert to z scores
  2. Rescale to 0 - 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would to convert to z score?

A

Subtracting the mean and dividing by the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How would you rescale to 0-1?

A

Subtracting the min value and then dividing by the range (max-min)

Xnorm = (x-min) / (max - min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the 4 strengths of k means?

A
  1. Simplicity & efficiency
  2. Scalable
  3. Flexible
  4. Interpretability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How is k means simple and efficient?

A

Straightforward to implement and computationally efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is k means scalable?

A

Well with large data and faster

17
Q

How is k means flexibile?

A

Roughly spherical and evenly sized

18
Q

How is k means interpretable?

A

Easy to understand & explain

19
Q

What are the 6 weaknesses of using k means?

A
  1. Input number of clusters
  2. Initialization sensitivity
  3. Outlier sensitivity
  4. Magnitude sensitivity
  5. Irregular clusters
  6. Distance metric dependency
20
Q

What is the most common measure when evaluating k mean clusters?

A

SSE (sum of squared error)