Lecture 6: Clustering Flashcards

Question 1

Q

What is clustering?

Answer

A

Clustering groups similar observations together, where the similarities within the groups are greater than those between the groups.

Question 2

Q

Name five types of dissimilarity measures

Answer

A

Euclidean distance: Most common, one we’ll use

Cosine similarity: Angle between two vectors and can be seen as a correlation measure

Hamming: Useful for binary coding and counts the number of differences

Manhattan (city block) distance: Cants take diagonal steps, only vertical and horizontal

Minkowski: Combination of the manhattan and euclidean distance measures

Question 3

Q

What classic mathematical formula is utilised to calculate euclidean distance?

Answer

A

Pythagoras theorem

Question 4

Q

How is pythagoras theorm used in euclidean distance?

Answer

A

Vertical and horizontal lines can be drawn through points p and q which we want to measure the distance between. These give us the x and y values of each point as well as forming a right angled triangle. The x values are denoted as p1 and q1 while the y values are denoted as p2 and q2. The formula is therefore the following:

d(p,q)^2 = (q1 - p1)^2 + (q2 - p2)^2

Question 5

Q

This is find for two points we want to measure the distance of, however, what do we do when working with a space of n dimensions or variables?

Answer

A

Use the generalised formula of:

d(p - q) = sqrt( (p1 - q1)^2 + (p2 - q2)^2 + … + (pn - qn)^2 )

Question 6

Q

In a practical example, how would this work if we’re working with data such as flower characteristics:

sepal length, sepal width, petal length and petal width

Answer

A

Say if, for two flowers the data is as following 
petal width: 0.2; 1.4
petal length: 1.4; 4.7
Sepal width: 3.5; 3.2
Sepal length: 5.1; 7

The first column would constitute as p and the second as q to make the following calculation:
d(p - q) = sqrt( (1.4 - 0.2)^2 + (4.7 - 1.4)^2 + (3.2 - 3.5)^2 + (7 - 5.1 )^2 )

Question 7

Q

How would you calculate the distance between two data vectors in R?

Answer

A

Using the dist() function;
dist() requires a matrix or dataframe as input, so we use row bind, rbind(), to create one:

rbind(p, q)
dist(rbind(p, q))

Question 8

Q

How would you compute the distance between all the datapoints in R?

Answer

A

dist(iris4D) to get a matrix of the distaces

Question 9

Q

What are the two main types of clustering methods?

Answer

A

K-means clustering

Hierarchal clustering

Question 10

Q

What steps are involved in k-means clustering? (4)

Answer

A

Initialize k-means clustering
Assign each point to the cluster of the nearest centroid
Recompute new cluster centroids
Repeat 2 & 3 until cluster assignment is stable

Question 11

Q

What is involved in initialising k-means clustering?

Answer

A

Choose a k
Randomly assign each point to one of the k-clusters
compute cluster centers (means) aka centroids

Question 12

Q

What is involved in calculating the centroids?

Answer

A

You calculate the means for each of the columns assigned to cluster n

Question 13

Q

How do we calculate the k-means in R?

Answer

A

irisKmeansFit

Question 14

Q

Describe how to interpret the output of the cluster function

Answer

A

The cluster means give the centroids of each variable fir each cluster

Clustering vector givces the cluster number for each datapoint

The within cluster sum of squares gives the square distance summed over all the points for each cluster. Beneath this we can see the within cluster sum of squares divided by the total sum of squares to give the percentage of explained variance (like an anova)

Lecture 6: Clustering Flashcards

(14 cards)