Principal Component Analysis Flashcards
What is PCA?
An unsupervised method for visualizing data.
Principal components are linear combinations of variables.
In PCA, we substract the mean from each xij so the mean of each variable is 0. Does is affect the variance of each Xi?
No.
For the first principal component (Z1), loadings are selected to maximize or minimize the variance of Z1?
Maximize.
How are the loadings for the first principal component selected?
Maximize the variance subject to the comstraint that the sum of the squares of the loadings is 1.
What is the score of an observation?
- Project the observation perpendicularly on the principal component line.
The score of the observarion is the distance of that projection point to the the (0,0) coordinate.
How are all other principal components defined?
They are defined to maximize the variance of the component with the constraint that the sum of the square of the loadings is 1 and be uncorrelated to the previous component.
(Each principal component is orthogonal to the hyperplan of the previous principal component)
What is a biplot and what is it used for?
A biplot plots two things. One using labels on the bottom and the left and the other using labels on the top and the right.
It can be used to visualise principal components.
True or false: principal components are the best linear approximation of the observations.
True
Does the scale of variables matter in linear regression, principal components or both?
Principal component only. If a variable is multiplied by a constant greater than one, its variance increses and PCA puts higher loading on the variable in order to maximize variance.
Why are variance usually scaled?
To avoid giving some variables spurious importance.
How does the variance of one variable affects its approximation using PCA?
The higher the variance of the variable is, the better the approximation.
The vector of loadings is unique up to sign. What does it mean?
One may obtain an equivalent solution by flipping the sign on all of the loadings. Flipping the sign of the loading vector results in flipping the sign on all of the scores.
What are the loadings?
The loadings indicate the direction of the principal component - direction is not affected by reversing the sign.
Why can’t cross validation be used to determine the number of principal component that should be used?
PCA is an unsupervised method and cross-validation is not available for unsupervised methods.
What methods can be used to determine the number of principal components that should be used?
- Cumulative proportion of the variance explained by the M first principal components and select M so that a specified proportion of variance is explained
- Scree plot: plot the proportion of variance explained by each principal component (m,PVEm). Look for the “elbow” point.