Chapter 6: PCA & Cluster Analysis Flashcards
Why are PCA and cluster analyses good for data exploration?
they are good for high dimensional datasets (large number of variables compared to observations)
to make sense of these datasets, it is necessary to consider a large group of variables as opposed to pairs of variables using bivariate data exploration (correlation matrices and scatterplots are ineffective in this setting.
how can unsupervised learning help supervised learning?
they have the potential of generating useful features which are by-products of our data exploration process.
what is the definition of PCA?
PCA is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative variables that capture most of the information in the original dataset.
what are PCs in PCA?
they are composite variables that are a linear combination of the existing variables.
- they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable for data exploration and visualization
T/F: typically, the observations of the features for PCA have been centered to have a zero mean.
TRUE
what are loadings in PCA?
they are the coefficients of the mth PC corresponding to the p features
if i = 1, …, n and j = 1, .. , p
T/F: the PCs are a linear combination of the n observations, so the sum of the PCs is taken over i, not j.
FALSE: the PCs are constructed from the features, so the sum is taken over j
how many loadings does the mth PC have?
p, one for each feature
how would we find the loadings for the first PC, Z1? any constraints?
we find the p loadings, such that they maximize the sample variance of Z1
constraints:
- the sum of squares of the p loadings must = 1
- each loading must be orthogonal (uncorrelated with) the previous PCs
geometrically, how do the p loadings of Z1 look with respect to the data?
the p loadings represent a line in the p-dimensional feature space among which the data varies the most
given the first PC, how are the subsequent PCs defined?
same as the first PC, but with the added condition that they cannot be correlated with each other
how does PCA reduce the dimension of a dataset?
it takes the p variables and outputs m PC loadings that together retain most of the information retain most of the information measured by variance.
with the dimension reduction, the dataset becomes much easier to explore and visualize.
how does PCA generate features?
this is the most important application of PCA.
once we have settled on the number of PCs to use, the original variables are replaced by the PCs, which capture most of the information in the dataset and serve as predictors for the target variable.
these predictors are mutually uncorrelated, so collinearity is no longer an issue.
by reducing the dimension of the data and the complexity of the model, we hope to optimize the bias-variance trade off and improve the prediction accuracy of the model.
How do we choose M, the number of PCs to use?
we assess the proportion of variance explained by each PC in comparison to the total variance present in the data
how to find the total variance of the dataset in PCA?
the sum of the sample variances of the p variables
why is it important to have scaled variables for PCA?
because it is useful for when we compare the PVEs!!!!
T/F: PVEs are monotonically increasing
false
by the definition of PCs, the PVEs are monotonically decreasing in m.
this is because subsequent PCs have more and more orthogonality constraints to comply with and therefore less flexibility with the choice of the PC loadings.
SO. the first PC explains the greatest amount of variance, and it decreases from there on
what graphical tool could you use to find the number of PCs to use? how can you justify your choice
a scree plot!
look for the elbow in the plot.
justification: the PVE of the next PC is sufficiently low enough to be dropped without losing much information
Is mean-centering the variables in PCA a great concern? why or why not?
not really, it does not affect the PC loadings since they are defined to maximize the sample variance of the PC scores.
- the variance remains unchanged, even when we add or subtract the variables by the same constant.
what is the difference between PC loadings, PCs and PC scores?
im not sure yet
why should we scale variables in PCA?
if we conduct PCA using the variables on their original scale, the PC loadings are determined based on the sample COVariance matrix of the variables
if we conduct PCA using the standardized variables, the PC loadings are determined based on the sample CORRelation matrix.
if no scaling is done and the variables are on different orders of magnitude, then those variables that have an unusually large variance will receive a large PC loading and dominate the corresponding PC. but, it’s not guaranteed that that variable explains much of the underlying pattern in the data to begin with.
can PCA be applied to categorical predictors?
no
what are 2 drawbacks of PCA?
Interpretability: not an easy task to make sense of the PCs in terms of effect on the target variable.
doesnt handle linear relationships well
how can we see the std dev of the variables in a dataframe in R?
use the apply() function
apply(dataset, margin = 2, sd)
margin = 2 : applies the function to the rows of the dataframe margin = 1 : cols
what function to use when running a PCA in R?
prcomp()
PCA <= prcomp( dataset , center = TRUE, scale. = TRUE)
- centering and scaling the variables
how to extract the PC loadings from the PCA object?
PCA.object.name $ rotation
how to extract the PC scores from a PCA object
PCA.object.name $ x
what is a biplot?
plot the scores of the PCs against one another to produce a low-dimensional view of the data
how do you create a biplot in r?
biplot() function
biplot(PCA, scale = 0 , cex = 0.6) play around with the two arguments
what do the axes on a biplot represent?
horizontal and vertical represent the same PC
top, right = PC loadings (look at the scale, it cannot be greater than 1)
bottom, left = PC scores
how to interpret the many many points of a biplot? how to interpret the lines?
points: they are the PC scores! (get their value with the bottom, left axes)
lines: PC loading vectors (get their value with the top, right axes)
how can you use a biplot to tell which variables are correlated to one another?
use the location of the loading values for each of the variables.
ex. variables that have similar loading vector values (endpoints of the lines) will be strongly correlated
what happens when you use the summary() function on a PCA object?
it outputs the std.dev, PVE and cumulative PVE of each PC
How to introduce a PC loading into the original dataset? (Feature generation) 2 methods. when should you use each method?
method 1:
dataset. new <= dataset
dataset. new$PC1 <= PCA $ [ , n] # whatever column you want (this will insert the first PC)
method 2: (suggested when there are many variables in the dataset)
dataset.new <= dataset
dataset. scaled <= as.data.frame(scale(dataset.new)) # scaling the dataset converts it to a numeric matrix, so we have to change it back to df
- we scale the dataset because ……
dataset.new $ newvar <= PCA$rotation[1,1]*dataset.scaled$1st var +….
and then delete the old variables
how does cluster analyses work?
it works by partitioning the observations in a dataset into a set of distinct homogeneous clusters, with the goal of revealing hidden patterns in the data.
observations in each cluster share the similar feature values, but observations in different clusters are pretty different from one another
how could clustering help with supervised learning?
the group assignments created as a result of the clustering is a factor variable which may serve as a useful feature for supervised learning
what is the goal of k-means clustering?
assign each observation in a dataset into one and only one of k clusters.
k is prespecified
how are the k clusters chosen?
such that the variation of the observations inside each cluster is relatively small while the variation between clusters is large.
explain the algorithm for k-means clustering.
initialization:
1. randomly select k points in the feature space, these are the initial cluster centres
iteration:
step 1: assign each observation to the cluster with the closest centre in terms of euc distance
step 2: recalculate the centre of the k clusters
step 3: repeat step 1 and 2 until the cluster assignments no longer change
T/F: in each iteration of k-means clustering, the within-cluster variation is automatically reduced
true.
at the completion of the algorithm we are guaranteed to arrive at a local optimum. why not necessarily global?
because the algorithm relies on the initial assignments, which are made randomly.
a different set of initial assignments may end up with a different final set of clusters and a different local optimum
what happens if one of the initial cluster centres is too close to an outlier
only that outlier will get assigned to that centre and it will form its own cluster (separated from the rest of the data)
how could we increase the chance of identifying a global optimum for k-means clustering?
run the clustering algorithm 20-50 times with different initial clustering assignments
how do we choose k in k-means clustering?
the elbow method.
we can make a plot of the ratio between cluster variation to the total variation in the data, against the value of k.
when the PVE has plateaued out, we reach the elbow, and the corresponding value of k provides an appropriate number of clusters to segment the data
what are the two main differences between hierarchical clustering and k-means?
- hierarchical does not require the choice of k in advance
2. the cluster groupings can be displayed visually (dendogram)
does hierarchical clustering start from bottom or top?
bottom
explain the hierarchical process
starts with the individual observations, each treated as a separate cluster and successively fuses the closest pair of clusters, one pair at a time.
this process goes on iteratively until all clusters are eventually fused into a single cluster containing all observations
what are the 4 linkages used in hierarchical clustering? describe them
complete: maximum pairwise distance
single: minimum
average: the average of all pairwise distances between observations in one cluster and observations in the other cluster
in practice, which linkages are used most often and why?
average and complete because they tend to result in more balanced (clusters formed have a similar number of observations) and visually appealing clusters
what type of dendrogram does single linkage lead to?
skewed. extended, trailing clusters with single observations fused one at a time
when is linkage used in hierarchical clustering?
it is not used on the first fuse, distance = min euclidean distance
it is only used when trying to calculate the distance between two clusters (at least one of the clusters has more than one observation)
the minimum distance in terms of linkage are the two clusters that are fused
do you really understand hierarchical clustering?
if not go to page 522 example 6.2.1
how can we tell which clusters are most similar when looking at the dendrogram?
clusters that are most similar = fused at the bottom
is randomization needed for k-means/hierarchical?
k: yes (for initial cluster centers)
h: no
is the number of clusters pre-specified in k/h?
k: yes (k needs to be specified)
h: no (can cut the dendrogram at any height)
are the clusters nested in k/h?
k: no
h: yes (it is a hierarchy of clusters)
does scaling the variables matter in k/h? what happens with/without scaling?
yes it matters for both, because the euclidean distance calculations depend very much on the scale on which the feature values are measured.
without scaling:
- if the variables are not of the same unit, one may have a larger order of magnitude and that variable will dominate the distance calculations and exert a disproportionate impact on the cluster arrangements
with scaling:
- we attach equal importance to each feature when performing distance calculations (which is more desirable for most applications)
how do you run a k-means clustering? do we need to set the seed?
using the kmeans() function.
need to set.seed() because the random assignment portion
takes a data matrix X
centers = # of clusters
nstart = how many random selections of initial cluster assignments (usually 20-50)
how to choose the number of clusters?
there is a long code for this but the idea is that we take out the original variables in the dataset , scale them and then manually perform k means clustering from 1 to 10 clusters.
then compute the ratio of between-cluster ss and total ss (bss/tss) for each round in a dataframe
then finally create an elbow plot using ggplot2
elbow plot:
k = 1:10
ggplot( dataframe of ratios, aes(x = k, y = bss_tss)) + geom_point() + geom_line() + labs(main = “elbow plot”)
how could we visualize the results of a k-means clustering?
- extract the numeric vector of the group assignments and convert it to a factor
dataset $ new.var <= as.factor(kmeans $ cluster) - create a scatterplot
ggplot(dataset, aes(x = PC1, y = PC2, col = group, label = row.names(dataset))) + geom_point() + geom_text(vjust = 1)
how do you implement a hierarchical clustering in R?
hclust() function
doesn’t take a data frame, it takes a numeric matrix carrying the pairwise euclidean distances for the n observations as an input
- this matrix can be found by using the dist() function
ex.
hclust( dist( cluster.variables ), method = “complete” )
how to plot a dendrogram in r?
plot( name of your hclust object, cex = 0.5)
how to cut the dendrogram in r?
using the cutree() function
ex. cutree( hclust object, # of clusters we want)