Chapter 4 Code Flashcards
What do you use to inspect a dataset
head()
skim()
How do you select specific columns?
Eg to drop the target variable
eg df <- data[, 1:4]
The target variable will be used to interpret clustering results
What do we need to do before performing clustering?
Ensure that all variables are on the same scale.
Perform data normalisation (centring and scaling) - ie scaling the data frame.
How do you scale the dataframe?
scale(data)
Then investigate using skim() - can see that this has normalised our data / transformed it to have N(0,1). Can see the mean column produced from skim() has minuscule values and SD column values equal to 1.
How do you compute the distance matrix for a dataframe?
dist(data, method = “euclidean”)
- method: “euclidean”, “manhattan”, “minkowski”, etc.
This will look odd - symmetrical with lots of blanks
How do you carry out hierarchical clustering?
Using the hclust() function
hc_ward <- hclust(d = dist_matrix, method = “ward.D2”)
- method: “single”, “complete”, “average”, “centroid”
How do you create a dendrogram?
fviz_dend()
eg fviz_dend(hc_ward, cex = 0.5)
cex = size of the labels
What do you need to do when you produce a dendrogram in the exam?
Describe it
eg
The large gap in height between the top branches and others suggests the presence of two major clusters.
Sub-branches indicate further divisions within these clusters.
What function determines the best number of clusters to use if we don’t know which to choose?
NbClust() and fviz_nbclust()
res_hc_automatic <- data %>%
NbClust(distance = “euclidean”,
min.nc = 2,
max.nc = 10,
method = “ward.D2”,
index =”all”)
fviz_nbclust(res_hc_automatic, ggtheme = theme_minimal())
NB: the distance function and the method specified must be the same as before
NB: mn.nc = 2, one doesn’t really cluster anything
IGNORE ANY ERRORS
What does the NbClust package do?
It provides 30 indices for determining the number of clusters and proposes to the user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
What should you do when you obtain the results from NbClust?
Compare the highest suggestions - use fviz_dend() again
fviz_dend(res_hc,
k = 2, # Cut in two groups
cex = 0.5, # Label size
palette = “jco”,
color_labels_by_k = TRUE, # Colour labels by groups
rect = TRUE, # Add rectangle around groups
show_labels = FALSE
)
How do you cut the tree into X groups to extract the cluster membership of each observation?
Using cutree()
cluster_id_2 <- cutree(res_hc, k = 2)
cluster_id_2
Get a list of group numbers (corresponding to elements of the dataset)
How do you append the cluster IDs obtained from cutree to the original dataset and add back original factor column?
iris_cluster_df <- iris_cluster_df %>%
as.data.frame() %>%
mutate(
Species = iris$Species,
cluster_id = cluster_id,
)
iris_cluster_df
How do you inspect the quality of the clustering (once the generated data has been incorporated with the original data)?
Inspect the quality of clustering for 2 clusters
table(iris_cluster_df$cluster_id_2, iris_cluster_df$Species)
Interpret the results
K = 2
Cluster 1: Contains almost all setosa samples (49 out of 50), indicating that the setosa species is highly distinct from the other species in the feature space.
Cluster 2: Contains nearly all versicolor and virginica samples (50 each) mixed into the same cluster. This suggests that these two species are less separable based on the features used.
K = 3
Cluster 1: Similar to the 2-cluster case, this cluster perfectly captures the setosa samples, confirming its clear separability.
Cluster 2: Contains most of the versicolor samples (27 out of 50) and a small number of virginica samples (2 out of 50). This indicates partial separation but some confusion between the two species.
Cluster 3: Contains a mix of versicolor (23 samples) and the majority of virginica (48 samples). This suggests that virginica dominates this cluster but still overlaps with versicolor.