Cluster Analysis P2 Flashcards

Question 1

Q

What is the cosine similarity distance measure?

Answer

A

For high-dimensional text data or sparse vectors

Question 2

Q

What is the manhattan distance distance measure used for?

Answer

A

horizontal/vertical steps

Question 3

Q

How is clustering subjective?

Answer

A

Types of clustering methods can have different conclusions

Question 4

Q

How is clustering ambiguous?

Answer

A

Same data can have different results depending on settings/algorithm

Question 5

Q

What are the 2 method in partitioning approach?

Answer

A

Global optimal
Heuristic methods

Question 6

Q

What is global optimal?

Answer

A

Exhaustively enumerator all partitions

Question 7

Q

Name the two heuristic methods…

Answer

A

K means
K medoids

Question 8

Q

What is k means?

Answer

A

Each cluster is represented by the custer center (mean)

Question 9

Q

What is k medoids?

Answer

A

Each cluster is represented by one of the objectis in the cluster

Question 10

Q

What are the four steps of k means clustering?

Answer

A

Partition into k clusters randomly
Assign cluster with closest centroids using euclidean distance
Recompute centroids
Stop when cluster assignment is unchanged

Question 11

Q

What are the two common options for normalization?

Answer

A

Convert to z scores
Rescale to 0 - 1

Question 12

Q

How would to convert to z score?

Answer

A

Subtracting the mean and dividing by the standard deviation

Question 13

Q

How would you rescale to 0-1?

Answer

A

Subtracting the min value and then dividing by the range (max-min)

Xnorm = (x-min) / (max - min)

Question 14

Q

What are the 4 strengths of k means?

Answer

A

Simplicity & efficiency
Scalable
Flexible
Interpretability

Question 15

Q

How is k means simple and efficient?

Answer

A

Straightforward to implement and computationally efficient

Question 16

Q

How is k means scalable?

Answer

A

Well with large data and faster

Question 17

Q

How is k means flexibile?

Answer

A

Roughly spherical and evenly sized

Question 18

Q

How is k means interpretable?

Answer

A

Easy to understand & explain

Question 19

Q

What are the 6 weaknesses of using k means?

Answer

A

Input number of clusters
Initialization sensitivity
Outlier sensitivity
Magnitude sensitivity
Irregular clusters
Distance metric dependency

Question 20

Q

What is the most common measure when evaluating k mean clusters?

Answer

A

SSE (sum of squared error)