PCA Flashcards
What is the meaning of PCA
Principal components analysis is concerned with explaining the covariance structure of a set of variables through a few linear combinations of the original variables. PCA re expresses large amounts of data to account for most information int he data
What is the use of PCA
Its a dimension reduction technique or si used as a method for identifying associations among variables
Explain the construction and structure of the new principal components
Aim of PCA is to describe variation in a set of correlation variables xi in terms of a new set of uncorrelated prinicpal compnonetns yi where the number of yis is substantially less than xis. Each yi is a linear combination of the xi variables.
What is meant by the principal components being in decreasing order of importance
The first principal component yi accounts for most variation in the original data out of all of the linear combinations of xis - Usually would aim to explain 80% to 90% of variation in data using Principal components and PC1 will explain a large part of that.
How do you find the eigenvalues and eigenvectors of a matrix
Solve det(A-lamdaI)=0 for eignvalues
Solve (A-lamdaI)v=0 for eigenvectors
What is the meaning of the eigenvalues and eigenvectors of covariance matrix S for set of data in terms of PCs
Eigenvalue j quantify how much of the variance is accounted for within each PCj. The eignevalue is the variance of each new PC.
Eigenvector j detail the linear combination of xi’s which form PCj
How much of the total variance does PC1 explain?
Lamda1/(sum of all lamdas)= % of total variation explained
Define the first principal component
First PC of a data set is the linear combination of the variables which has greatest variance
What is the total variance of data
Sum of lamda i’s
What is a key assumption of the set of PCs
They are uncorrelated with each other
In words : what is the proportion of variation explained by each PC and how do we use this to decide how many PCs to use to describe the data
Each eigenvalue divided by the sum of all eigenvalues gives proportion of the variation explained by the associated principal component. This cumulative proportion of variation helps to decide how many PCs to use.
What is a disadvantage to PCA
Interpretation of the new PCs can be difficult
It gives large weight to variables who have a large range of values
In the coefficients of linear combinations of the variables that construct PCs - What matters in terms of signs, comparison, magnitude
Signs on the coefficients are aribtray but it matters if they are opposite to another element. The magnitudes also matter
Why might standardisation be needed in PCA
To prevent variables with bigger variances perhaps in smaller units being weighted more heavily than other more important variables
What does standardisation mean and how does one do it
Standardisation means ensuring the data is expressed as comparable units - We divide each variable by the sample stdev for that variable which forces all variances to be 1. Hence we are now working with a correlation matrix instead of a covariance matrix