Chapter 6: PCA & Cluster Analysis Flashcards
Why are PCA and cluster analyses good for data exploration?
they are good for high dimensional datasets (large number of variables compared to observations)
to make sense of these datasets, it is necessary to consider a large group of variables as opposed to pairs of variables using bivariate data exploration (correlation matrices and scatterplots are ineffective in this setting.
how can unsupervised learning help supervised learning?
they have the potential of generating useful features which are by-products of our data exploration process.
what is the definition of PCA?
PCA is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative variables that capture most of the information in the original dataset.
what are PCs in PCA?
they are composite variables that are a linear combination of the existing variables.
- they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable for data exploration and visualization
T/F: typically, the observations of the features for PCA have been centered to have a zero mean.
TRUE
what are loadings in PCA?
they are the coefficients of the mth PC corresponding to the p features
if i = 1, …, n and j = 1, .. , p
T/F: the PCs are a linear combination of the n observations, so the sum of the PCs is taken over i, not j.
FALSE: the PCs are constructed from the features, so the sum is taken over j
how many loadings does the mth PC have?
p, one for each feature
how would we find the loadings for the first PC, Z1? any constraints?
we find the p loadings, such that they maximize the sample variance of Z1
constraints:
- the sum of squares of the p loadings must = 1
- each loading must be orthogonal (uncorrelated with) the previous PCs
geometrically, how do the p loadings of Z1 look with respect to the data?
the p loadings represent a line in the p-dimensional feature space among which the data varies the most
given the first PC, how are the subsequent PCs defined?
same as the first PC, but with the added condition that they cannot be correlated with each other
how does PCA reduce the dimension of a dataset?
it takes the p variables and outputs m PC loadings that together retain most of the information retain most of the information measured by variance.
with the dimension reduction, the dataset becomes much easier to explore and visualize.
how does PCA generate features?
this is the most important application of PCA.
once we have settled on the number of PCs to use, the original variables are replaced by the PCs, which capture most of the information in the dataset and serve as predictors for the target variable.
these predictors are mutually uncorrelated, so collinearity is no longer an issue.
by reducing the dimension of the data and the complexity of the model, we hope to optimize the bias-variance trade off and improve the prediction accuracy of the model.
How do we choose M, the number of PCs to use?
we assess the proportion of variance explained by each PC in comparison to the total variance present in the data
how to find the total variance of the dataset in PCA?
the sum of the sample variances of the p variables
why is it important to have scaled variables for PCA?
because it is useful for when we compare the PVEs!!!!
T/F: PVEs are monotonically increasing
false
by the definition of PCs, the PVEs are monotonically decreasing in m.
this is because subsequent PCs have more and more orthogonality constraints to comply with and therefore less flexibility with the choice of the PC loadings.
SO. the first PC explains the greatest amount of variance, and it decreases from there on
what graphical tool could you use to find the number of PCs to use? how can you justify your choice
a scree plot!
look for the elbow in the plot.
justification: the PVE of the next PC is sufficiently low enough to be dropped without losing much information
Is mean-centering the variables in PCA a great concern? why or why not?
not really, it does not affect the PC loadings since they are defined to maximize the sample variance of the PC scores.
- the variance remains unchanged, even when we add or subtract the variables by the same constant.
what is the difference between PC loadings, PCs and PC scores?
im not sure yet
why should we scale variables in PCA?
if we conduct PCA using the variables on their original scale, the PC loadings are determined based on the sample COVariance matrix of the variables
if we conduct PCA using the standardized variables, the PC loadings are determined based on the sample CORRelation matrix.
if no scaling is done and the variables are on different orders of magnitude, then those variables that have an unusually large variance will receive a large PC loading and dominate the corresponding PC. but, it’s not guaranteed that that variable explains much of the underlying pattern in the data to begin with.
can PCA be applied to categorical predictors?
no
what are 2 drawbacks of PCA?
Interpretability: not an easy task to make sense of the PCs in terms of effect on the target variable.
doesnt handle linear relationships well
how can we see the std dev of the variables in a dataframe in R?
use the apply() function
apply(dataset, margin = 2, sd)
margin = 2 : applies the function to the rows of the dataframe margin = 1 : cols
what function to use when running a PCA in R?
prcomp()
PCA <= prcomp( dataset , center = TRUE, scale. = TRUE)
- centering and scaling the variables