Data Reduction: Principal Component Analysis (PCA) Flashcards
what is data/dimension reduction?
a statistical procedure used to reduce a large set of variables to a smaller set
typically we reduce sets of variables:
- PCA
- FA
typically we reduce sets of observations:
- k-clustering
- latent class analysis
Uses of data reduction techniques
Theory testing
= what are the number and nature that best describe a theoretical construct?
e.g. data based on personality → reduced to form 5 factor model of personality
Uses of data reduction techniques
Test construction
= how should I group my items into subscales?
= which items are best to measure my constructs
e.g. anywhere we construct a test
Uses of data reduction techniques
Pragmatic
= I have too many variables / multicollinearity issues, how can I defensibly combine my variables?
e.g. taking the mean = reduces many numbers to one number
e.g. genetics → hundreds can be tested so we might want to group them
this is where PCA comes in - if all of our variables correlate, we want to see ‘can we explain all the variance in these items using one variable?’
Purpose of PCA
the goal of PCA is to explain as much of the total variance in the dataset as possible
NOTE PCA is exploratory and is a weighted composite
Procedure of PCA
1) starts with the original data
2) calculates covariances (correlations) between the variables
3) applies eigen decomposition to calculate a set of linear composites of the original variables
PCA in R
1) starts with a correlation matrix
2) we can convert this to a PCA output which represents the degree to which each item contributes to a composite
e.g. R tries and fits as much variance as possible into the first component, once it can’t fit anymore it tries a second component
What does PCA do?
it repackages the variance from the correlation matrix into a set of components
What are components?
they are orthogonal (and therefore uncorrelated) linear combinations of the original variables
Each component accounts for as much variance as possible (1st accounting for most possible variance, 2nd accounting for 2nd most variance etc)
If variables are closely related (large correlations) we can represent them with fewer components
Eigen decomposition
what is Eigen decomposition?
it is a transformation of the correlation matrix to re-express it in terms of eigen values and eigen vectors
in R :
eigen(cor(data)
Eigen decomposition
What are eigen values?
= a measure of the size of the variance packaged into a component
larger eigen values mean that the component accounts for a large proportion of the variance
Eigen decomposition
what are eigen vectors?
= provide information about the relationship of each variable to each component
Eigen vectors are a set of weights - one weight per variable in the original correlation matrix
Larger weights means a variable makes a bigger contribution to that component
Eigen decomposition
Eigen values and variance
the sum of the eigen values will equal the number of variables in the dataset
A full eigen decomposition accounts for all variance distributed across eigen values
If we want to know the variance accounted for by a given component:
= eigen value ÷ total variance
= eigen value ÷ number of items
Eigen decomposition
How many components to keep?
Eigen decomposition repackages our variance but it does not reduce our dimensions. Dimension reduction comes from keeping only the largest components.
There are various methods for this - it is usually best to use a few and then make a practical decision on how many to keep
Dimension reduction methods
Variance Accounted for …
Simply select the minimum variance you wish accounted for
e.g. “components >1 should be selected” then get rid of all those with eigen values <1
Dimension reduction methods
Skree plots
Plots based on plotting eigen values
- to determine how many components to keep, we look for the kink in the graph (assumed to represent the point at which components become substantively unimportant)
- Keep the number of components to the left of the kink
ISSUE = sometimes the kink is not obvious so can be subjective cut off point
Dimension reduction methods
Minimum average partial test (MAP)
MAP extracts components iteratively from the correlation matrix and computes the average squared partial correlation after each extraction - called the MAP value
at first the MAP values goes down with each extraction but then it starts to increase again
- MAP keeps the components from which the average squared partial correlation is smallest
in R we us vss()
This will produce a graph
- The STRAIGHTEST line on the graph denotes how many components to keep
ISSUE = sometimes suggests too few components (under extraction)
Dimension reduction methods
Parallel Analysis
1) creates a dataset with the same number of variables but NO correlations
2) computes an eigen decomposition for the simulated datasets
3) compares the average eigen value across the simulated dataset for each component
4) if a real eigen value exceeds the corresponding simulated eigen value - we KEEP that component
in R we use fa.parallel()
this produces a graph
- the moment the real eigen values go below the line of the simulated eigen values is where the cut off point is
ISSUE = sometimes suggests too many components (over extraction)
Interpreting our chosen components
Eigen vectors and PCA loadings
we use eigen vectors to think about variance but they are hard to interpret so we convert them to PCA loadings
there is an equation but the basics of it is:
essentially we are scaling the eigen vectors by the eigen values such that the components with the largest eigen values will have the largest loadings
Interpreting our chosen components
PCA loadings
= they give the strength of the relationship between each item and the component
- range from 1 to -1 (higher value = stronger relationship)
the sum of squared loadings for any variable on all components will be 1 (all variance in the term is explained in a full decomposition)
Interpreting our chosen components
Running a PCA with reduced number of components
In R, we use the principal() function in the psych package
- we supply the data frame of the correlation matrix as the first argument
- we specify the number of components to keep with the “nfactors = “ argument
It can be useful to compare solutions with different numbers of components to check which solution makes the most sense
Interpreting our chosen components
Interpreting the components
Once we have decided how many components to keep we examine the PCA solution based on the component loadings
- component loadings are calculated from the values in the eigen vectors
- they can be interpreted as the correlations between variables and components
A good PCA solution explains the variance of the original correlation matrix in as few components as possible
Interpreting our chosen components
PCA scores
after conducting a PCA you may want to create scores for the new dimensions
simplest method = sum all the scores that ‘belong’ to a component (>0.3 loading)
best method = take the weights into account (i.e. eigen values and vectors)
Reporting PCA
Main principles = transparency and reproducibility
methods
- to decide number of factors
- rotation method
Results
- considerations in the choice of number of components
- how many components were retained
- loading matrix of the chosen solution
- variance explained by components
- labelling and interpretation of components