Chapter 4 Code Flashcards

1
Q

What do you use to inspect a dataset

A

head()
skim()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you select specific columns?

Eg to drop the target variable

A

eg df <- data[, 1:4]

The target variable will be used to interpret clustering results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we need to do before performing clustering?

A

Ensure that all variables are on the same scale.

Perform data normalisation (centring and scaling) - ie scaling the data frame.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you scale the dataframe?

A

scale(data)

Then investigate using skim() - can see that this has normalised our data / transformed it to have N(0,1). Can see the mean column produced from skim() has minuscule values and SD column values equal to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you compute the distance matrix for a dataframe?

A

dist(data, method = “euclidean”)

  • method: “euclidean”, “manhattan”, “minkowski”, etc.

This will look odd - symmetrical with lots of blanks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you carry out hierarchical clustering?

A

Using the hclust() function

hc_ward <- hclust(d = dist_matrix, method = “ward.D2”)

  • method: “single”, “complete”, “average”, “centroid”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you create a dendrogram?

A

fviz_dend()

eg fviz_dend(hc_ward, cex = 0.5)

cex = size of the labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do you need to do when you produce a dendrogram in the exam?

A

Describe it

eg
The large gap in height between the top branches and others suggests the presence of two major clusters.
Sub-branches indicate further divisions within these clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What function determines the best number of clusters to use if we don’t know which to choose?

A

NbClust() and fviz_nbclust()

res_hc_automatic <- data %>%
NbClust(distance = “euclidean”,
min.nc = 2,
max.nc = 10,
method = “ward.D2”,
index =”all”)

fviz_nbclust(res_hc_automatic, ggtheme = theme_minimal())

NB: the distance function and the method specified must be the same as before
NB: mn.nc = 2, one doesn’t really cluster anything

IGNORE ANY ERRORS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the NbClust package do?

A

It provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should you do when you obtain the results from NbClust?

A

Compare the highest suggestions - use fviz_dend() again

fviz_dend(res_hc,
k = 2, # Cut in two groups
cex = 0.5, # Label size
palette = “jco”,
color_labels_by_k = TRUE, # Colour labels by groups
rect = TRUE, # Add rectangle around groups
show_labels = FALSE
)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you cut the tree into X groups to extract the cluster membership of each observation?

A

Using cutree()

cluster_id_2 <- cutree(res_hc, k = 2)
cluster_id_2

Get a list of group numbers (corresponding to elements of the dataset)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you append the cluster IDs obtained from cutree to the original dataset and add back original factor column?

A

iris_cluster_df <- iris_cluster_df %>%
as.data.frame() %>%
mutate(
Species = iris$Species,
cluster_id = cluster_id,
)

iris_cluster_df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you inspect the quality of the clustering (once the generated data has been incorporated with the original data)?

A

Inspect the quality of clustering for 2 clusters
table(iris_cluster_df$cluster_id_2, iris_cluster_df$Species)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Interpret the results

A

K = 2
Cluster 1: Contains almost all setosa samples (49 out of 50), indicating that the setosa species is highly distinct from the other species in the feature space.
Cluster 2: Contains nearly all versicolor and virginica samples (50 each) mixed into the same cluster. This suggests that these two species are less separable based on the features used.

K = 3
Cluster 1: Similar to the 2-cluster case, this cluster perfectly captures the setosa samples, confirming its clear separability.
Cluster 2: Contains most of the versicolor samples (27 out of 50) and a small number of virginica samples (2 out of 50). This indicates partial separation but some confusion between the two species.
Cluster 3: Contains a mix of versicolor (23 samples) and the majority of virginica (48 samples). This suggests that virginica dominates this cluster but still overlaps with versicolor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What do you need to do before clustering?

A

Ensure all variables are on the same scale.

scale()

17
Q

How do you find the optimal number of clusters using the elbow method?

A

The elbow method (Total Within Sum of Squares, method=”wss)

fviz_nbclust(df, kmeans, method = “wss”)

18
Q

How do you find the optimal number of clusters using the gap statistic?

A

fviz_nbclust(df, kmeans, method = “gap_stat”)

19
Q

What does the gap statistic measure?

A

The gap statistic measures how much better the clustering result is compared to random clustering. A higher gap statistic indicates better-defined clusters.

20
Q

How do you compare the quality of clustering of two suggested numbers of clusters (based on various algorithms)?

A

res_kmeans_2 <- kmeans(df, centers = 2, nstart = 25)
res_kmeans_3 <- kmeans(df, centers = 3, nstart = 25)

Look at the centres of the clusters:
res_kmeans_2$centers

21
Q

How do you append the cluster IDs (and original categorical variable) to the dataset?

Why do you want to do this?

A

df <- df %>%
as.data.frame() %>%
mutate(
Species = data$Species,
cluster_id_2 = res_kmeans_2$cluster,
cluster_id_3 = res_kmeans_3$cluster
)

df

Allows for performing meaningful interpretation of discovered subgroups

22
Q

How do you inspect the quality of clustering from this appended dataset?

A

Do it individually for each cluster_id added

table(df$cluster_id_2, df$Species)