Data Reduction: Principal Component Analysis (PCA) Flashcards

1
Q

what is data/dimension reduction?

A

a statistical procedure used to reduce a large set of variables to a smaller set

typically we reduce sets of variables:
- PCA
- FA

typically we reduce sets of observations:
- k-clustering
- latent class analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Uses of data reduction techniques

Theory testing

A

= what are the number and nature that best describe a theoretical construct?

e.g. data based on personality → reduced to form 5 factor model of personality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Uses of data reduction techniques

Test construction

A

= how should I group my items into subscales?
= which items are best to measure my constructs

e.g. anywhere we construct a test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Uses of data reduction techniques

Pragmatic

A

= I have too many variables / multicollinearity issues, how can I defensibly combine my variables?

e.g. taking the mean = reduces many numbers to one number
e.g. genetics → hundreds can be tested so we might want to group them

this is where PCA comes in - if all of our variables correlate, we want to see ‘can we explain all the variance in these items using one variable?’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Purpose of PCA

A

the goal of PCA is to explain as much of the total variance in the dataset as possible

NOTE PCA is exploratory and is a weighted composite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Procedure of PCA

A

1) starts with the original data

2) calculates covariances (correlations) between the variables

3) applies eigen decomposition to calculate a set of linear composites of the original variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

PCA in R

A

1) starts with a correlation matrix

2) we can convert this to a PCA output which represents the degree to which each item contributes to a composite
e.g. R tries and fits as much variance as possible into the first component, once it can’t fit anymore it tries a second component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does PCA do?

A

it repackages the variance from the correlation matrix into a set of components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are components?

A

they are orthogonal (and therefore uncorrelated) linear combinations of the original variables

Each component accounts for as much variance as possible (1st accounting for most possible variance, 2nd accounting for 2nd most variance etc)
If variables are closely related (large correlations) we can represent them with fewer components

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Eigen decomposition

what is Eigen decomposition?

A

it is a transformation of the correlation matrix to re-express it in terms of eigen values and eigen vectors

in R :
eigen(cor(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Eigen decomposition

What are eigen values?

A

= a measure of the size of the variance packaged into a component

larger eigen values mean that the component accounts for a large proportion of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Eigen decomposition

what are eigen vectors?

A

= provide information about the relationship of each variable to each component

Eigen vectors are a set of weights - one weight per variable in the original correlation matrix
Larger weights means a variable makes a bigger contribution to that component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Eigen decomposition

Eigen values and variance

A

the sum of the eigen values will equal the number of variables in the dataset

A full eigen decomposition accounts for all variance distributed across eigen values

If we want to know the variance accounted for by a given component:
= eigen value ÷ total variance
= eigen value ÷ number of items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Eigen decomposition

How many components to keep?

A

Eigen decomposition repackages our variance but it does not reduce our dimensions. Dimension reduction comes from keeping only the largest components.
There are various methods for this - it is usually best to use a few and then make a practical decision on how many to keep

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Dimension reduction methods

Variance Accounted for …

A

Simply select the minimum variance you wish accounted for

e.g. “components >1 should be selected” then get rid of all those with eigen values <1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Dimension reduction methods

Skree plots

A

Plots based on plotting eigen values

  • to determine how many components to keep, we look for the kink in the graph (assumed to represent the point at which components become substantively unimportant)
  • Keep the number of components to the left of the kink

ISSUE = sometimes the kink is not obvious so can be subjective cut off point

17
Q

Dimension reduction methods

Minimum average partial test (MAP)

A

MAP extracts components iteratively from the correlation matrix and computes the average squared partial correlation after each extraction - called the MAP value

at first the MAP values goes down with each extraction but then it starts to increase again

  • MAP keeps the components from which the average squared partial correlation is smallest

in R we us vss()
This will produce a graph
- The STRAIGHTEST line on the graph denotes how many components to keep

ISSUE = sometimes suggests too few components (under extraction)

18
Q

Dimension reduction methods

Parallel Analysis

A

1) creates a dataset with the same number of variables but NO correlations

2) computes an eigen decomposition for the simulated datasets

3) compares the average eigen value across the simulated dataset for each component

4) if a real eigen value exceeds the corresponding simulated eigen value - we KEEP that component

in R we use fa.parallel()
this produces a graph
- the moment the real eigen values go below the line of the simulated eigen values is where the cut off point is

ISSUE = sometimes suggests too many components (over extraction)

19
Q

Interpreting our chosen components

Eigen vectors and PCA loadings

A

we use eigen vectors to think about variance but they are hard to interpret so we convert them to PCA loadings

there is an equation but the basics of it is:
essentially we are scaling the eigen vectors by the eigen values such that the components with the largest eigen values will have the largest loadings

20
Q

Interpreting our chosen components

PCA loadings

A

= they give the strength of the relationship between each item and the component

  • range from 1 to -1 (higher value = stronger relationship)

the sum of squared loadings for any variable on all components will be 1 (all variance in the term is explained in a full decomposition)

21
Q

Interpreting our chosen components

Running a PCA with reduced number of components

A

In R, we use the principal() function in the psych package

  • we supply the data frame of the correlation matrix as the first argument
  • we specify the number of components to keep with the “nfactors = “ argument

It can be useful to compare solutions with different numbers of components to check which solution makes the most sense

22
Q

Interpreting our chosen components

Interpreting the components

A

Once we have decided how many components to keep we examine the PCA solution based on the component loadings

  • component loadings are calculated from the values in the eigen vectors
  • they can be interpreted as the correlations between variables and components

A good PCA solution explains the variance of the original correlation matrix in as few components as possible

23
Q

Interpreting our chosen components

PCA scores

A

after conducting a PCA you may want to create scores for the new dimensions

simplest method = sum all the scores that ‘belong’ to a component (>0.3 loading)

best method = take the weights into account (i.e. eigen values and vectors)

24
Q

Reporting PCA

A

Main principles = transparency and reproducibility

methods
- to decide number of factors
- rotation method

Results
- considerations in the choice of number of components
- how many components were retained
- loading matrix of the chosen solution
- variance explained by components
- labelling and interpretation of components