Data Reduction: Principal Component Analysis (PCA) Flashcards
what is data/dimension reduction?
a statistical procedure used to reduce a large set of variables to a smaller set
typically we reduce sets of variables:
- PCA
- FA
typically we reduce sets of observations:
- k-clustering
- latent class analysis
Uses of data reduction techniques
Theory testing
= what are the number and nature that best describe a theoretical construct?
e.g. data based on personality → reduced to form 5 factor model of personality
Uses of data reduction techniques
Test construction
= how should I group my items into subscales?
= which items are best to measure my constructs
e.g. anywhere we construct a test
Uses of data reduction techniques
Pragmatic
= I have too many variables / multicollinearity issues, how can I defensibly combine my variables?
e.g. taking the mean = reduces many numbers to one number
e.g. genetics → hundreds can be tested so we might want to group them
this is where PCA comes in - if all of our variables correlate, we want to see ‘can we explain all the variance in these items using one variable?’
Purpose of PCA
the goal of PCA is to explain as much of the total variance in the dataset as possible
NOTE PCA is exploratory and is a weighted composite
Procedure of PCA
1) starts with the original data
2) calculates covariances (correlations) between the variables
3) applies eigen decomposition to calculate a set of linear composites of the original variables
PCA in R
1) starts with a correlation matrix
2) we can convert this to a PCA output which represents the degree to which each item contributes to a composite
e.g. R tries and fits as much variance as possible into the first component, once it can’t fit anymore it tries a second component
What does PCA do?
it repackages the variance from the correlation matrix into a set of components
What are components?
they are orthogonal (and therefore uncorrelated) linear combinations of the original variables
Each component accounts for as much variance as possible (1st accounting for most possible variance, 2nd accounting for 2nd most variance etc)
If variables are closely related (large correlations) we can represent them with fewer components
Eigen decomposition
what is Eigen decomposition?
it is a transformation of the correlation matrix to re-express it in terms of eigen values and eigen vectors
in R :
eigen(cor(data)
Eigen decomposition
What are eigen values?
= a measure of the size of the variance packaged into a component
larger eigen values mean that the component accounts for a large proportion of the variance
Eigen decomposition
what are eigen vectors?
= provide information about the relationship of each variable to each component
Eigen vectors are a set of weights - one weight per variable in the original correlation matrix
Larger weights means a variable makes a bigger contribution to that component
Eigen decomposition
Eigen values and variance
the sum of the eigen values will equal the number of variables in the dataset
A full eigen decomposition accounts for all variance distributed across eigen values
If we want to know the variance accounted for by a given component:
= eigen value ÷ total variance
= eigen value ÷ number of items
Eigen decomposition
How many components to keep?
Eigen decomposition repackages our variance but it does not reduce our dimensions. Dimension reduction comes from keeping only the largest components.
There are various methods for this - it is usually best to use a few and then make a practical decision on how many to keep
Dimension reduction methods
Variance Accounted for …
Simply select the minimum variance you wish accounted for
e.g. “components >1 should be selected” then get rid of all those with eigen values <1