Chapter 10 - PCA Flashcards
PCA
principle component analysis. provides a way to visualize high dimensional data, summarizing the most important information.
The first principle component
a vector which passes closest to a cloud of samples, minimizes the average length of the perpendicular distance from points to a line. The first principle component is is the once with highest variance (maximizes the variance of the projection onto that vector).
PCA Example
let x be a data matrix with n samples and p variables. 1) center the data [subtract the mean] 2) maximize {()} where () is the projection of the ith sample onto phi, known as the score. {} is the variance of the n samples projected onto phi. Second principle component follows same rules, also must be orthogonal to first principle component (same as saying scores are uncorrelated).
How do we solve the PCA optimization?
linear algebra. 1) single value decomposition of X, X = Usigmaphi^T 2) the eigenvalue decomposition of X^T*X
The Biplot (axes, trends)
x axis is the first principle component, y axis is the second principle component. variables: arrows that are clustered together are highly positively correlated, arrows that are directly opposed are negatively correlated
Scaling Variables
Before centering the variable, we multiply the sample by a constant that makes the variance = 1. This is called standardizing or scaling the variables. In some cases we don’t want to standardize (e.g., if all the units are the same, then the absolute value of each sample is important).
How do we know how many principle components are enough?
by the proportion of variance explained. The ith score vector can be interpreted as a new variable. The variance of this variable decreases as we take i from 1 to p. We can quantify how much of the variance is captured by the first m principle components/score variables. The score variables have the same variance as the sample just distributed. In this way we can quantify how much of the variance is captured in the first i principle components (scree plot, cumulative scree plot).