Chapter 2.4: Measures of Similarity and Dissimilarity Flashcards
similarity
The similarity between two objects is a numerical measure of the degree to which the two objects are alike.
Similarities are higher for pairs of objects that are more alike.
Similarities are usually non-negative and are often between 0 (no similarity) and 1 (complete similarity)
dissimilarity
The dissimilarity between two objects is a numerical measure of the degree to which the two objects are different.
Dissimilarities are lower for more similar pairs of objects.
3 Properties of distance metrics
- Positivity
- Symmetry
- Triangle inequality
3 Properties of distance metrics
Positivity
- d(x, y) ≥ 0 for all x and y
- d(x, y) = 0 only if x = y
3 Properties of distance metrics
Symmetry
d(x, y) = d(y, x) for all x and y
3 Properties of distance metrics
Triangle inequality
d(x, z) ≤ d(x, y) + d(y, z)
for all points x, y and z
2 typical properties of similarities
If s(x, y) is the similarity between points x and y:
- s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
- s(x, y) = s(y, x) for all x and y (symmetry)
Similarity coefficients
Similarity measures between objects that contain only binary attributes are called similarity coefficients.
Simple Matching Coefficient
A similarity coefficient defined as:
SMC = number of matching attribute values / number of attributes
Jaccard Coefficient
Used when we have a binary dataset (all attributes are either 0 or 1).
J = number of matching presences / number of attributes not involved in 00 matches
= f11 / (f11 + f10 + f01)
Cosine Similarity
cos(x, y) = (x’y) / (||x|| ||y||)
Extended Jaccard Coefficient
EJ = (x‘y) / (x’x + y’y - x’y)
Correlation
corr(x, y) =
covariance(x, y)
/ [standard_dev(x) × standard_dev(y)]