Session 6.2 Flashcards
Clustering - what is it?
• Finding groups in data.
• Organising data into groups such that there is:
(1) high similarity within each group,
(2) low similarity across the groups.
Is clustering the same as classification?
No
• Class labels can be found directly in the data. E.g., blood type.
• Different goals: to “understand” the data better (explore), to organise the
information we have.
Distance measures
- Euclidean distance
- Manhattan distance
- Jaccard distance
- Cosine distance
- Edit distance (Levenshtein metric)
Jaccard distance used when
The possession of a common characteristic between two items is important, but the common absence of a characteristic is not.
• Especially useful when dealing with problems that involve (large) sets of
characteristics that may not be ‘symmetrically’ important.
• Text mining: compare whether two documents contain the same word.
Cosine distance often encountered in
text mining or recommendation engines
Edit distance (Levenshtein metric)
- Text mining applications.
* Applications: Autocorrect (spelling mistakes).
Euclidean distance
- The most common geometric distance measure.
- A numeric dataset with attributes similar in terms of measurement type (similar scale) and units.
- Can be understood as physical distance between two data points.
Manhattan distance
- The sum of the absolute differences between pairwise attributes.
- The “taxicab” distance.
Cosine distance
The term relates to the method of measurement - the cosine of the angle between two vectors.