E6 Flashcards

Question 1

Q

What is clustering?

Answer

A

• Finding groups in data.
• Organizing data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.

Question 2

Q

Is clustering the same as classification?

Answer

A

No.

Class labels can be found directly in the data. E.g., blood type.
Different goals: to “understand” the data better (explore), to organize the information we have.

Question 3

Q

Distance measures

Answer

A

• Euclidean distance
-> physical distance between two data points

• Manhattan distance
-> taxicab distance -> absolute difference

• Jaccard distance
-> treat two objects as sets of characteristics (text mining same word)

• Cosine distance
-> cosine of angle between two vectors (often text mining/recommend)

• Edit distance
-> Levenshtein metric -> autocorrect (spelling mistakes)

Question 4

Q

K-means clustering - how to?

Answer

A

Select proximity measure and specify the number of clusters (k).
Initiate the process by selecting centroids.
Assign the data points to the “nearest” centroid to form a cluster.
Calculate the new centroid.
Iterate over steps 3 and 4 until the stopping criteria are fulfilled.

Question 5

Q

Strengths k-means

Answer

A

Simple

- Efficient

Question 6

Q

Weaknesses k-means

Answer

A

the value of k – how to determine it?
converges to locally optimal solution
globular/spherical clusters

Question 7

Q

Hierarchical clustering

Answer

A

Creates a collection of ways to group the points

Question 8

Q

Output of hierarchical clustering

Answer

A

Dendograms

Question 9

Q

Strengths hierarchical clustering

Answer

A

Clusters can be of any size and shape.

* Does not require to prespecify the number of clusters.

Question 10

Q

Weaknesses hierarchical clustering

Answer

A

Still need to decide where to split.

* Computationally inefficient.

E6 Flashcards

(10 cards)