Lecture 6: Clustering Flashcards
What is clustering?
Clustering groups similar observations together, where the similarities within the groups are greater than those between the groups.
Name five types of dissimilarity measures
Euclidean distance: Most common, one we’ll use
Cosine similarity: Angle between two vectors and can be seen as a correlation measure
Hamming: Useful for binary coding and counts the number of differences
Manhattan (city block) distance: Cants take diagonal steps, only vertical and horizontal
Minkowski: Combination of the manhattan and euclidean distance measures
What classic mathematical formula is utilised to calculate euclidean distance?
Pythagoras theorem
How is pythagoras theorm used in euclidean distance?
Vertical and horizontal lines can be drawn through points p and q which we want to measure the distance between. These give us the x and y values of each point as well as forming a right angled triangle. The x values are denoted as p1 and q1 while the y values are denoted as p2 and q2. The formula is therefore the following:
d(p,q)^2 = (q1 - p1)^2 + (q2 - p2)^2
This is find for two points we want to measure the distance of, however, what do we do when working with a space of n dimensions or variables?
Use the generalised formula of:
d(p - q) = sqrt( (p1 - q1)^2 + (p2 - q2)^2 + … + (pn - qn)^2 )
In a practical example, how would this work if we’re working with data such as flower characteristics:
sepal length, sepal width, petal length and petal width
Say if, for two flowers the data is as following petal width: 0.2; 1.4 petal length: 1.4; 4.7 Sepal width: 3.5; 3.2 Sepal length: 5.1; 7
The first column would constitute as p and the second as q to make the following calculation:
d(p - q) = sqrt( (1.4 - 0.2)^2 + (4.7 - 1.4)^2 + (3.2 - 3.5)^2 + (7 - 5.1 )^2 )
How would you calculate the distance between two data vectors in R?
Using the dist() function;
dist() requires a matrix or dataframe as input, so we use row bind, rbind(), to create one:
rbind(p, q)
dist(rbind(p, q))
How would you compute the distance between all the datapoints in R?
dist(iris4D) to get a matrix of the distaces
What are the two main types of clustering methods?
K-means clustering
Hierarchal clustering
What steps are involved in k-means clustering? (4)
- Initialize k-means clustering
- Assign each point to the cluster of the nearest centroid
- Recompute new cluster centroids
- Repeat 2 & 3 until cluster assignment is stable
What is involved in initialising k-means clustering?
- Choose a k
- Randomly assign each point to one of the k-clusters
- compute cluster centers (means) aka centroids
What is involved in calculating the centroids?
You calculate the means for each of the columns assigned to cluster n
How do we calculate the k-means in R?
irisKmeansFit
Describe how to interpret the output of the cluster function
The cluster means give the centroids of each variable fir each cluster
Clustering vector givces the cluster number for each datapoint
The within cluster sum of squares gives the square distance summed over all the points for each cluster. Beneath this we can see the within cluster sum of squares divided by the total sum of squares to give the percentage of explained variance (like an anova)