Session 6.2 Flashcards

Question 1

Q

Clustering - what is it?

Answer

A

• Finding groups in data.
• Organising data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.

Question 2

Q

Is clustering the same as classification?

Answer

A

No

• Class labels can be found directly in the data. E.g., blood type.

• Different goals: to “understand” the data better (explore), to organise the
information we have.

Question 3

Q

Distance measures

Answer

A

Euclidean distance
Manhattan distance
Jaccard distance
Cosine distance
Edit distance (Levenshtein metric)

Question 4

Q

Jaccard distance used when

Answer

A

The possession of a common characteristic between two items is important, but the common absence of a characteristic is not.

• Especially useful when dealing with problems that involve (large) sets of
characteristics that may not be ‘symmetrically’ important.

• Text mining: compare whether two documents contain the same word.

Question 5

Q

Cosine distance often encountered in

Answer

A

text mining or recommendation engines

Question 6

Q

Edit distance (Levenshtein metric)

Answer

A

Text mining applications.

* Applications: Autocorrect (spelling mistakes).

Question 7

Q

Euclidean distance

Answer

A

The most common geometric distance measure.
A numeric dataset with attributes similar in terms of measurement type (similar scale) and units.
Can be understood as physical distance between two data points.

Question 8

Q

Manhattan distance

Answer

A

The sum of the absolute differences between pairwise attributes.
The “taxicab” distance.

Question 9

Q

Cosine distance

Answer

A

The term relates to the method of measurement - the cosine of the angle between two vectors.

Session 6.2 Flashcards

(9 cards)