Lecture 6: Clustering Flashcards

1
Q

What is clustering?

A

Clustering groups similar observations together, where the similarities within the groups are greater than those between the groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name five types of dissimilarity measures

A

Euclidean distance: Most common, one we’ll use

Cosine similarity: Angle between two vectors and can be seen as a correlation measure

Hamming: Useful for binary coding and counts the number of differences

Manhattan (city block) distance: Cants take diagonal steps, only vertical and horizontal

Minkowski: Combination of the manhattan and euclidean distance measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What classic mathematical formula is utilised to calculate euclidean distance?

A

Pythagoras theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is pythagoras theorm used in euclidean distance?

A

Vertical and horizontal lines can be drawn through points p and q which we want to measure the distance between. These give us the x and y values of each point as well as forming a right angled triangle. The x values are denoted as p1 and q1 while the y values are denoted as p2 and q2. The formula is therefore the following:

d(p,q)^2 = (q1 - p1)^2 + (q2 - p2)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

This is find for two points we want to measure the distance of, however, what do we do when working with a space of n dimensions or variables?

A

Use the generalised formula of:

d(p - q) = sqrt( (p1 - q1)^2 + (p2 - q2)^2 + … + (pn - qn)^2 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In a practical example, how would this work if we’re working with data such as flower characteristics:

sepal length, sepal width, petal length and petal width

A
Say if, for two flowers the data is as following 
petal width: 0.2; 1.4
petal length: 1.4; 4.7
Sepal width: 3.5; 3.2
Sepal length: 5.1; 7

The first column would constitute as p and the second as q to make the following calculation:
d(p - q) = sqrt( (1.4 - 0.2)^2 + (4.7 - 1.4)^2 + (3.2 - 3.5)^2 + (7 - 5.1 )^2 )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How would you calculate the distance between two data vectors in R?

A

Using the dist() function;
dist() requires a matrix or dataframe as input, so we use row bind, rbind(), to create one:

rbind(p, q)
dist(rbind(p, q))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How would you compute the distance between all the datapoints in R?

A

dist(iris4D) to get a matrix of the distaces

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the two main types of clustering methods?

A

K-means clustering

Hierarchal clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What steps are involved in k-means clustering? (4)

A
  1. Initialize k-means clustering
  2. Assign each point to the cluster of the nearest centroid
  3. Recompute new cluster centroids
  4. Repeat 2 & 3 until cluster assignment is stable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is involved in initialising k-means clustering?

A
  • Choose a k
  • Randomly assign each point to one of the k-clusters
  • compute cluster centers (means) aka centroids
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is involved in calculating the centroids?

A

You calculate the means for each of the columns assigned to cluster n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do we calculate the k-means in R?

A

irisKmeansFit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe how to interpret the output of the cluster function

A

The cluster means give the centroids of each variable fir each cluster

Clustering vector givces the cluster number for each datapoint

The within cluster sum of squares gives the square distance summed over all the points for each cluster. Beneath this we can see the within cluster sum of squares divided by the total sum of squares to give the percentage of explained variance (like an anova)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly