Chapter 4 Code Flashcards

Question 1

Q

What do you use to inspect a dataset

Answer

A

head()
skim()

Question 2

Q

How do you select specific columns?

Eg to drop the target variable

Answer

A

eg df <- data[, 1:4]

The target variable will be used to interpret clustering results

Question 3

Q

What do we need to do before performing clustering?

Answer

A

Ensure that all variables are on the same scale.

Perform data normalisation (centring and scaling) - ie scaling the data frame.

Question 4

Q

How do you scale the dataframe?

Answer

A

scale(data)

Then investigate using skim() - can see that this has normalised our data / transformed it to have N(0,1). Can see the mean column produced from skim() has minuscule values and SD column values equal to 1.

Question 5

Q

How do you compute the distance matrix for a dataframe?

Answer

A

dist(data, method = “euclidean”)

method: “euclidean”, “manhattan”, “minkowski”, etc.

This will look odd - symmetrical with lots of blanks

Question 6

Q

How do you carry out hierarchical clustering?

Answer

A

Using the hclust() function

hc_ward <- hclust(d = dist_matrix, method = “ward.D2”)

method: “single”, “complete”, “average”, “centroid”

Question 7

Q

How do you create a dendrogram?

Answer

A

fviz_dend()

eg fviz_dend(hc_ward, cex = 0.5)

cex = size of the labels

Question 8

Q

What do you need to do when you produce a dendrogram in the exam?

Answer

A

Describe it

eg
The large gap in height between the top branches and others suggests the presence of two major clusters.
Sub-branches indicate further divisions within these clusters.

Question 9

Q

What function determines the best number of clusters to use if we don’t know which to choose?

Answer

A

NbClust() and fviz_nbclust()

res_hc_automatic <- data %>%
NbClust(distance = “euclidean”,
min.nc = 2,
max.nc = 10,
method = “ward.D2”,
index =”all”)

fviz_nbclust(res_hc_automatic, ggtheme = theme_minimal())

NB: the distance function and the method specified must be the same as before
NB: mn.nc = 2, one doesn’t really cluster anything

IGNORE ANY ERRORS

Question 10

Q

What does the NbClust package do?

Answer

A

It provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

Question 11

Q

What should you do when you obtain the results from NbClust?

Answer

A

Compare the highest suggestions - use fviz_dend() again

fviz_dend(res_hc,
k = 2, # Cut in two groups
cex = 0.5, # Label size
palette = “jco”,
color_labels_by_k = TRUE, # Colour labels by groups
rect = TRUE, # Add rectangle around groups
show_labels = FALSE
)

Question 12

Q

How do you cut the tree into X groups to extract the cluster membership of each observation?

Answer

A

Using cutree()

cluster_id_2 <- cutree(res_hc, k = 2)
cluster_id_2

Get a list of group numbers (corresponding to elements of the dataset)

Question 13

Q

How do you append the cluster IDs obtained from cutree to the original dataset and add back original factor column?

Answer

A

iris_cluster_df <- iris_cluster_df %>%
as.data.frame() %>%
mutate(
Species = iris$Species,
cluster_id = cluster_id,
)

iris_cluster_df

Question 14

Q

How do you inspect the quality of the clustering (once the generated data has been incorporated with the original data)?

Answer

A

Inspect the quality of clustering for 2 clusters
table(iris_cluster_df$cluster_id_2, iris_cluster_df$Species)

Question 15

Q

Interpret the results

Answer

A

K = 2
Cluster 1: Contains almost all setosa samples (49 out of 50), indicating that the setosa species is highly distinct from the other species in the feature space.
Cluster 2: Contains nearly all versicolor and virginica samples (50 each) mixed into the same cluster. This suggests that these two species are less separable based on the features used.

K = 3
Cluster 1: Similar to the 2-cluster case, this cluster perfectly captures the setosa samples, confirming its clear separability.
Cluster 2: Contains most of the versicolor samples (27 out of 50) and a small number of virginica samples (2 out of 50). This indicates partial separation but some confusion between the two species.
Cluster 3: Contains a mix of versicolor (23 samples) and the majority of virginica (48 samples). This suggests that virginica dominates this cluster but still overlaps with versicolor.

Question 16

Q

What do you need to do before clustering?

Answer

Study These Flashcards

A

Ensure all variables are on the same scale.

scale()

Question 17

Q

How do you find the optimal number of clusters using the elbow method?

Answer

Study These Flashcards

A

The elbow method (Total Within Sum of Squares, method=”wss)

fviz_nbclust(df, kmeans, method = “wss”)

Question 18

Q

How do you find the optimal number of clusters using the gap statistic?

Answer

Study These Flashcards

A

fviz_nbclust(df, kmeans, method = “gap_stat”)

Question 19

Q

What does the gap statistic measure?

Answer

Study These Flashcards

A

The gap statistic measures how much better the clustering result is compared to random clustering. A higher gap statistic indicates better-defined clusters.

Question 20

Q

How do you compare the quality of clustering of two suggested numbers of clusters (based on various algorithms)?

Answer

Study These Flashcards

A

res_kmeans_2 <- kmeans(df, centers = 2, nstart = 25)
res_kmeans_3 <- kmeans(df, centers = 3, nstart = 25)

Look at the centres of the clusters:
res_kmeans_2$centers

Question 21

Q

How do you append the cluster IDs (and original categorical variable) to the dataset?

Why do you want to do this?

Answer

Study These Flashcards

A

df <- df %>%
as.data.frame() %>%
mutate(
Species = data$Species,
cluster_id_2 = res_kmeans_2$cluster,
cluster_id_3 = res_kmeans_3$cluster
)

df

Allows for performing meaningful interpretation of discovered subgroups

Question 22

Q

How do you inspect the quality of clustering from this appended dataset?

Answer

Study These Flashcards

A

Do it individually for each cluster_id added

table(df$cluster_id_2, df$Species)

Chapter 4 Code Flashcards

(22 cards)