Chapter 6: PCA & Cluster Analysis Flashcards

Question

what function to use when running a PCA in R?

Answer 1

prcomp() PCA <= prcomp( dataset , center = TRUE, scale. = TRUE) - centering and scaling the variables

Answer 2

PCA.object.name $ rotation

Answer 3

PCA.object.name $ x

Answer 4

plot the scores of the PCs against one another to produce a low-dimensional view of the data

Answer 5

biplot() function biplot(PCA, scale = 0 , cex = 0.6) play around with the two arguments

Answer 6

horizontal and vertical represent the same PC top, right = PC loadings (look at the scale, it cannot be greater than 1) bottom, left = PC scores

Answer 7

points: they are the PC scores! (get their value with the bottom, left axes) lines: PC loading vectors (get their value with the top, right axes)

Answer 8

use the location of the loading values for each of the variables. ex. variables that have similar loading vector values (endpoints of the lines) will be strongly correlated

Answer 9

it outputs the std.dev, PVE and cumulative PVE of each PC

Answer 10

method 1: dataset. new <= dataset dataset. new$PC1 <= PCA $ [ , n] # whatever column you want (this will insert the first PC) method 2: (suggested when there are many variables in the dataset) dataset.new <= dataset dataset. scaled <= as.data.frame(scale(dataset.new)) # scaling the dataset converts it to a numeric matrix, so we have to change it back to df - we scale the dataset because ...... dataset.new $ newvar <= PCA$rotation[1,1]*dataset.scaled$1st var +.... and then delete the old variables

Answer 11

it works by partitioning the observations in a dataset into a set of distinct homogeneous clusters, with the goal of revealing hidden patterns in the data. observations in each cluster share the similar feature values, but observations in different clusters are pretty different from one another

Answer 12

the group assignments created as a result of the clustering is a factor variable which may serve as a useful feature for supervised learning

Answer 13

assign each observation in a dataset into one and only one of k clusters. k is prespecified

Answer 14

such that the variation of the observations inside each cluster is relatively small while the variation between clusters is large.

Answer 15

initialization: 1. randomly select k points in the feature space, these are the initial cluster centres iteration: step 1: assign each observation to the cluster with the closest centre in terms of euc distance step 2: recalculate the centre of the k clusters step 3: repeat step 1 and 2 until the cluster assignments no longer change

Answer 16

because the algorithm relies on the initial assignments, which are made randomly. a different set of initial assignments may end up with a different final set of clusters and a different local optimum

Answer 17

only that outlier will get assigned to that centre and it will form its own cluster (separated from the rest of the data)

Answer 18

run the clustering algorithm 20-50 times with different initial clustering assignments

Answer 19

the elbow method. we can make a plot of the ratio between cluster variation to the total variation in the data, against the value of k. when the PVE has plateaued out, we reach the elbow, and the corresponding value of k provides an appropriate number of clusters to segment the data

Answer 20

1. hierarchical does not require the choice of k in advance | 2. the cluster groupings can be displayed visually (dendogram)

Answer 21

starts with the individual observations, each treated as a separate cluster and successively fuses the closest pair of clusters, one pair at a time. this process goes on iteratively until all clusters are eventually fused into a single cluster containing all observations

Answer 22

complete: maximum pairwise distance single: minimum average: the average of all pairwise distances between observations in one cluster and observations in the other cluster

Answer 23

average and complete because they tend to result in more balanced (clusters formed have a similar number of observations) and visually appealing clusters

Answer 24

skewed. extended, trailing clusters with single observations fused one at a time

Answer 25

it is not used on the first fuse, distance = min euclidean distance it is only used when trying to calculate the distance between two clusters (at least one of the clusters has more than one observation) the minimum distance in terms of linkage are the two clusters that are fused

Answer 26

if not go to page 522 example 6.2.1

Answer 27

clusters that are most similar = fused at the bottom

Answer 28

k: yes (for initial cluster centers) h: no

Answer 29

k: yes (k needs to be specified) h: no (can cut the dendrogram at any height)

Answer 30

k: no h: yes (it is a hierarchy of clusters)

Answer 31

yes it matters for both, because the euclidean distance calculations depend very much on the scale on which the feature values are measured. without scaling: - if the variables are not of the same unit, one may have a larger order of magnitude and that variable will dominate the distance calculations and exert a disproportionate impact on the cluster arrangements with scaling: - we attach equal importance to each feature when performing distance calculations (which is more desirable for most applications)

Answer 32

using the kmeans() function. need to set.seed() because the random assignment portion takes a data matrix X centers = # of clusters nstart = how many random selections of initial cluster assignments (usually 20-50)

Answer 33

there is a long code for this but the idea is that we take out the original variables in the dataset , scale them and then manually perform k means clustering from 1 to 10 clusters. then compute the ratio of between-cluster ss and total ss (bss/tss) for each round in a dataframe then finally create an elbow plot using ggplot2 elbow plot: k = 1:10 ggplot( dataframe of ratios, aes(x = k, y = bss_tss)) + geom_point() + geom_line() + labs(main = "elbow plot")

Answer 34

1. extract the numeric vector of the group assignments and convert it to a factor dataset $ new.var <= as.factor(kmeans $ cluster) 2. create a scatterplot ggplot(dataset, aes(x = PC1, y = PC2, col = group, label = row.names(dataset))) + geom_point() + geom_text(vjust = 1)

Answer 35

hclust() function doesn't take a data frame, it takes a numeric matrix carrying the pairwise euclidean distances for the n observations as an input - this matrix can be found by using the dist() function ex. hclust( dist( cluster.variables ), method = "complete" )

Answer 36

plot( name of your hclust object, cex = 0.5)

Answer 37

using the cutree() function ex. cutree( hclust object, # of clusters we want)

Chapter 6: PCA & Cluster Analysis Flashcards

(63 cards)