Chapter 6: PCA & Cluster Analysis Flashcards

1
Q

Why are PCA and cluster analyses good for data exploration?

A

they are good for high dimensional datasets (large number of variables compared to observations)

to make sense of these datasets, it is necessary to consider a large group of variables as opposed to pairs of variables using bivariate data exploration (correlation matrices and scatterplots are ineffective in this setting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how can unsupervised learning help supervised learning?

A

they have the potential of generating useful features which are by-products of our data exploration process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the definition of PCA?

A

PCA is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative variables that capture most of the information in the original dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what are PCs in PCA?

A

they are composite variables that are a linear combination of the existing variables.
- they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable for data exploration and visualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

T/F: typically, the observations of the features for PCA have been centered to have a zero mean.

A

TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what are loadings in PCA?

A

they are the coefficients of the mth PC corresponding to the p features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

if i = 1, …, n and j = 1, .. , p

T/F: the PCs are a linear combination of the n observations, so the sum of the PCs is taken over i, not j.

A

FALSE: the PCs are constructed from the features, so the sum is taken over j

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how many loadings does the mth PC have?

A

p, one for each feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

how would we find the loadings for the first PC, Z1? any constraints?

A

we find the p loadings, such that they maximize the sample variance of Z1

constraints:
- the sum of squares of the p loadings must = 1
- each loading must be orthogonal (uncorrelated with) the previous PCs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

geometrically, how do the p loadings of Z1 look with respect to the data?

A

the p loadings represent a line in the p-dimensional feature space among which the data varies the most

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

given the first PC, how are the subsequent PCs defined?

A

same as the first PC, but with the added condition that they cannot be correlated with each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how does PCA reduce the dimension of a dataset?

A

it takes the p variables and outputs m PC loadings that together retain most of the information retain most of the information measured by variance.

with the dimension reduction, the dataset becomes much easier to explore and visualize.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how does PCA generate features?

A

this is the most important application of PCA.

once we have settled on the number of PCs to use, the original variables are replaced by the PCs, which capture most of the information in the dataset and serve as predictors for the target variable.

these predictors are mutually uncorrelated, so collinearity is no longer an issue.

by reducing the dimension of the data and the complexity of the model, we hope to optimize the bias-variance trade off and improve the prediction accuracy of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do we choose M, the number of PCs to use?

A

we assess the proportion of variance explained by each PC in comparison to the total variance present in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

how to find the total variance of the dataset in PCA?

A

the sum of the sample variances of the p variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

why is it important to have scaled variables for PCA?

A

because it is useful for when we compare the PVEs!!!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

T/F: PVEs are monotonically increasing

A

false

by the definition of PCs, the PVEs are monotonically decreasing in m.
this is because subsequent PCs have more and more orthogonality constraints to comply with and therefore less flexibility with the choice of the PC loadings.

SO. the first PC explains the greatest amount of variance, and it decreases from there on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

what graphical tool could you use to find the number of PCs to use? how can you justify your choice

A

a scree plot!

look for the elbow in the plot.

justification: the PVE of the next PC is sufficiently low enough to be dropped without losing much information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Is mean-centering the variables in PCA a great concern? why or why not?

A

not really, it does not affect the PC loadings since they are defined to maximize the sample variance of the PC scores.
- the variance remains unchanged, even when we add or subtract the variables by the same constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is the difference between PC loadings, PCs and PC scores?

A

im not sure yet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

why should we scale variables in PCA?

A

if we conduct PCA using the variables on their original scale, the PC loadings are determined based on the sample COVariance matrix of the variables

if we conduct PCA using the standardized variables, the PC loadings are determined based on the sample CORRelation matrix.

if no scaling is done and the variables are on different orders of magnitude, then those variables that have an unusually large variance will receive a large PC loading and dominate the corresponding PC. but, it’s not guaranteed that that variable explains much of the underlying pattern in the data to begin with.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

can PCA be applied to categorical predictors?

A

no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what are 2 drawbacks of PCA?

A

Interpretability: not an easy task to make sense of the PCs in terms of effect on the target variable.
doesnt handle linear relationships well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

how can we see the std dev of the variables in a dataframe in R?

A

use the apply() function

apply(dataset, margin = 2, sd)

margin = 2 : applies the function to the rows of the dataframe 
margin = 1 : cols
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

what function to use when running a PCA in R?

A

prcomp()

PCA <= prcomp( dataset , center = TRUE, scale. = TRUE)
- centering and scaling the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

how to extract the PC loadings from the PCA object?

A

PCA.object.name $ rotation

27
Q

how to extract the PC scores from a PCA object

A

PCA.object.name $ x

28
Q

what is a biplot?

A

plot the scores of the PCs against one another to produce a low-dimensional view of the data

29
Q

how do you create a biplot in r?

A

biplot() function

biplot(PCA, scale = 0 , cex = 0.6) play around with the two arguments

30
Q

what do the axes on a biplot represent?

A

horizontal and vertical represent the same PC

top, right = PC loadings (look at the scale, it cannot be greater than 1)

bottom, left = PC scores

31
Q

how to interpret the many many points of a biplot? how to interpret the lines?

A

points: they are the PC scores! (get their value with the bottom, left axes)
lines: PC loading vectors (get their value with the top, right axes)

32
Q

how can you use a biplot to tell which variables are correlated to one another?

A

use the location of the loading values for each of the variables.

ex. variables that have similar loading vector values (endpoints of the lines) will be strongly correlated

33
Q

what happens when you use the summary() function on a PCA object?

A

it outputs the std.dev, PVE and cumulative PVE of each PC

34
Q

How to introduce a PC loading into the original dataset? (Feature generation) 2 methods. when should you use each method?

A

method 1:

dataset. new <= dataset
dataset. new$PC1 <= PCA $ [ , n] # whatever column you want (this will insert the first PC)

method 2: (suggested when there are many variables in the dataset)
dataset.new <= dataset

dataset. scaled <= as.data.frame(scale(dataset.new)) # scaling the dataset converts it to a numeric matrix, so we have to change it back to df
- we scale the dataset because ……

dataset.new $ newvar <= PCA$rotation[1,1]*dataset.scaled$1st var +….
and then delete the old variables

35
Q

how does cluster analyses work?

A

it works by partitioning the observations in a dataset into a set of distinct homogeneous clusters, with the goal of revealing hidden patterns in the data.

observations in each cluster share the similar feature values, but observations in different clusters are pretty different from one another

36
Q

how could clustering help with supervised learning?

A

the group assignments created as a result of the clustering is a factor variable which may serve as a useful feature for supervised learning

37
Q

what is the goal of k-means clustering?

A

assign each observation in a dataset into one and only one of k clusters.

k is prespecified

38
Q

how are the k clusters chosen?

A

such that the variation of the observations inside each cluster is relatively small while the variation between clusters is large.

39
Q

explain the algorithm for k-means clustering.

A

initialization:
1. randomly select k points in the feature space, these are the initial cluster centres

iteration:
step 1: assign each observation to the cluster with the closest centre in terms of euc distance
step 2: recalculate the centre of the k clusters
step 3: repeat step 1 and 2 until the cluster assignments no longer change

40
Q

T/F: in each iteration of k-means clustering, the within-cluster variation is automatically reduced

A

true.

41
Q

at the completion of the algorithm we are guaranteed to arrive at a local optimum. why not necessarily global?

A

because the algorithm relies on the initial assignments, which are made randomly.

a different set of initial assignments may end up with a different final set of clusters and a different local optimum

42
Q

what happens if one of the initial cluster centres is too close to an outlier

A

only that outlier will get assigned to that centre and it will form its own cluster (separated from the rest of the data)

43
Q

how could we increase the chance of identifying a global optimum for k-means clustering?

A

run the clustering algorithm 20-50 times with different initial clustering assignments

44
Q

how do we choose k in k-means clustering?

A

the elbow method.

we can make a plot of the ratio between cluster variation to the total variation in the data, against the value of k.

when the PVE has plateaued out, we reach the elbow, and the corresponding value of k provides an appropriate number of clusters to segment the data

45
Q

what are the two main differences between hierarchical clustering and k-means?

A
  1. hierarchical does not require the choice of k in advance

2. the cluster groupings can be displayed visually (dendogram)

46
Q

does hierarchical clustering start from bottom or top?

A

bottom

47
Q

explain the hierarchical process

A

starts with the individual observations, each treated as a separate cluster and successively fuses the closest pair of clusters, one pair at a time.

this process goes on iteratively until all clusters are eventually fused into a single cluster containing all observations

48
Q

what are the 4 linkages used in hierarchical clustering? describe them

A

complete: maximum pairwise distance
single: minimum
average: the average of all pairwise distances between observations in one cluster and observations in the other cluster

49
Q

in practice, which linkages are used most often and why?

A

average and complete because they tend to result in more balanced (clusters formed have a similar number of observations) and visually appealing clusters

50
Q

what type of dendrogram does single linkage lead to?

A

skewed. extended, trailing clusters with single observations fused one at a time

51
Q

when is linkage used in hierarchical clustering?

A

it is not used on the first fuse, distance = min euclidean distance
it is only used when trying to calculate the distance between two clusters (at least one of the clusters has more than one observation)

the minimum distance in terms of linkage are the two clusters that are fused

52
Q

do you really understand hierarchical clustering?

A

if not go to page 522 example 6.2.1

53
Q

how can we tell which clusters are most similar when looking at the dendrogram?

A

clusters that are most similar = fused at the bottom

54
Q

is randomization needed for k-means/hierarchical?

A

k: yes (for initial cluster centers)
h: no

55
Q

is the number of clusters pre-specified in k/h?

A

k: yes (k needs to be specified)
h: no (can cut the dendrogram at any height)

56
Q

are the clusters nested in k/h?

A

k: no
h: yes (it is a hierarchy of clusters)

57
Q

does scaling the variables matter in k/h? what happens with/without scaling?

A

yes it matters for both, because the euclidean distance calculations depend very much on the scale on which the feature values are measured.

without scaling:
- if the variables are not of the same unit, one may have a larger order of magnitude and that variable will dominate the distance calculations and exert a disproportionate impact on the cluster arrangements

with scaling:
- we attach equal importance to each feature when performing distance calculations (which is more desirable for most applications)

58
Q

how do you run a k-means clustering? do we need to set the seed?

A

using the kmeans() function.
need to set.seed() because the random assignment portion

takes a data matrix X
centers = # of clusters
nstart = how many random selections of initial cluster assignments (usually 20-50)

59
Q

how to choose the number of clusters?

A

there is a long code for this but the idea is that we take out the original variables in the dataset , scale them and then manually perform k means clustering from 1 to 10 clusters.

then compute the ratio of between-cluster ss and total ss (bss/tss) for each round in a dataframe

then finally create an elbow plot using ggplot2
elbow plot:
k = 1:10
ggplot( dataframe of ratios, aes(x = k, y = bss_tss)) + geom_point() + geom_line() + labs(main = “elbow plot”)

60
Q

how could we visualize the results of a k-means clustering?

A
  1. extract the numeric vector of the group assignments and convert it to a factor
    dataset $ new.var <= as.factor(kmeans $ cluster)
  2. create a scatterplot
    ggplot(dataset, aes(x = PC1, y = PC2, col = group, label = row.names(dataset))) + geom_point() + geom_text(vjust = 1)
61
Q

how do you implement a hierarchical clustering in R?

A

hclust() function

doesn’t take a data frame, it takes a numeric matrix carrying the pairwise euclidean distances for the n observations as an input
- this matrix can be found by using the dist() function

ex.
hclust( dist( cluster.variables ), method = “complete” )

62
Q

how to plot a dendrogram in r?

A

plot( name of your hclust object, cex = 0.5)

63
Q

how to cut the dendrogram in r?

A

using the cutree() function

ex. cutree( hclust object, # of clusters we want)