Data Reduction: Principal Component Analysis (PCA) Flashcards

Question 1

Q

what is data/dimension reduction?

Answer

A

a statistical procedure used to reduce a large set of variables to a smaller set

typically we reduce sets of variables:
- PCA
- FA

typically we reduce sets of observations:
- k-clustering
- latent class analysis

Question 2

Q

Uses of data reduction techniques

Theory testing

Answer

A

= what are the number and nature that best describe a theoretical construct?

e.g. data based on personality → reduced to form 5 factor model of personality

Question 3

Q

Uses of data reduction techniques

Test construction

Answer

A

= how should I group my items into subscales?
= which items are best to measure my constructs

e.g. anywhere we construct a test

Question 4

Q

Uses of data reduction techniques

Pragmatic

Answer

A

= I have too many variables / multicollinearity issues, how can I defensibly combine my variables?

e.g. taking the mean = reduces many numbers to one number
e.g. genetics → hundreds can be tested so we might want to group them

this is where PCA comes in - if all of our variables correlate, we want to see ‘can we explain all the variance in these items using one variable?’

Question 5

Q

Purpose of PCA

Answer

A

the goal of PCA is to explain as much of the total variance in the dataset as possible

NOTE PCA is exploratory and is a weighted composite

Question 6

Q

Procedure of PCA

Answer

A

1) starts with the original data

2) calculates covariances (correlations) between the variables

3) applies eigen decomposition to calculate a set of linear composites of the original variables

Question 7

Q

PCA in R

Answer

A

1) starts with a correlation matrix

2) we can convert this to a PCA output which represents the degree to which each item contributes to a composite
e.g. R tries and fits as much variance as possible into the first component, once it can’t fit anymore it tries a second component

Question 8

Q

What does PCA do?

Answer

A

it repackages the variance from the correlation matrix into a set of components

Question 9

Q

What are components?

Answer

A

they are orthogonal (and therefore uncorrelated) linear combinations of the original variables

Each component accounts for as much variance as possible (1st accounting for most possible variance, 2nd accounting for 2nd most variance etc)
If variables are closely related (large correlations) we can represent them with fewer components

Question 10

Q

Eigen decomposition

what is Eigen decomposition?

Answer

A

it is a transformation of the correlation matrix to re-express it in terms of eigen values and eigen vectors

in R :
eigen(cor(data)

Question 11

Q

Eigen decomposition

What are eigen values?

Answer

A

= a measure of the size of the variance packaged into a component

larger eigen values mean that the component accounts for a large proportion of the variance

Question 12

Q

Eigen decomposition

what are eigen vectors?

Answer

A

= provide information about the relationship of each variable to each component

Eigen vectors are a set of weights - one weight per variable in the original correlation matrix
Larger weights means a variable makes a bigger contribution to that component

Question 13

Q

Eigen decomposition

Eigen values and variance

Answer

A

the sum of the eigen values will equal the number of variables in the dataset

A full eigen decomposition accounts for all variance distributed across eigen values

If we want to know the variance accounted for by a given component:
= eigen value ÷ total variance
= eigen value ÷ number of items

Question 14

Q

Eigen decomposition

How many components to keep?

Answer

A

Eigen decomposition repackages our variance but it does not reduce our dimensions. Dimension reduction comes from keeping only the largest components.
There are various methods for this - it is usually best to use a few and then make a practical decision on how many to keep

Question 15

Q

Dimension reduction methods

Variance Accounted for …

Answer

A

Simply select the minimum variance you wish accounted for

e.g. “components >1 should be selected” then get rid of all those with eigen values <1

Question 16

Q

Dimension reduction methods

Skree plots

Answer

Study These Flashcards

A

Plots based on plotting eigen values

to determine how many components to keep, we look for the kink in the graph (assumed to represent the point at which components become substantively unimportant)
Keep the number of components to the left of the kink

ISSUE = sometimes the kink is not obvious so can be subjective cut off point

Question 17

Q

Dimension reduction methods

Minimum average partial test (MAP)

Answer

Study These Flashcards

A

MAP extracts components iteratively from the correlation matrix and computes the average squared partial correlation after each extraction - called the MAP value

at first the MAP values goes down with each extraction but then it starts to increase again

MAP keeps the components from which the average squared partial correlation is smallest

in R we us vss()
This will produce a graph
- The STRAIGHTEST line on the graph denotes how many components to keep

ISSUE = sometimes suggests too few components (under extraction)

Question 18

Q

Dimension reduction methods

Parallel Analysis

Answer

Study These Flashcards

A

1) creates a dataset with the same number of variables but NO correlations

2) computes an eigen decomposition for the simulated datasets

3) compares the average eigen value across the simulated dataset for each component

4) if a real eigen value exceeds the corresponding simulated eigen value - we KEEP that component

in R we use fa.parallel()
this produces a graph
- the moment the real eigen values go below the line of the simulated eigen values is where the cut off point is

ISSUE = sometimes suggests too many components (over extraction)

Question 19

Q

Interpreting our chosen components

Eigen vectors and PCA loadings

Answer

Study These Flashcards

A

we use eigen vectors to think about variance but they are hard to interpret so we convert them to PCA loadings

there is an equation but the basics of it is:
essentially we are scaling the eigen vectors by the eigen values such that the components with the largest eigen values will have the largest loadings

Question 20

Q

Interpreting our chosen components

PCA loadings

Answer

Study These Flashcards

A

= they give the strength of the relationship between each item and the component

range from 1 to -1 (higher value = stronger relationship)

the sum of squared loadings for any variable on all components will be 1 (all variance in the term is explained in a full decomposition)

Question 21

Q

Interpreting our chosen components

Running a PCA with reduced number of components

Answer

Study These Flashcards

A

In R, we use the principal() function in the psych package

we supply the data frame of the correlation matrix as the first argument
we specify the number of components to keep with the “nfactors = “ argument

It can be useful to compare solutions with different numbers of components to check which solution makes the most sense

Question 22

Q

Interpreting our chosen components

Interpreting the components

Answer

Study These Flashcards

A

Once we have decided how many components to keep we examine the PCA solution based on the component loadings

component loadings are calculated from the values in the eigen vectors
they can be interpreted as the correlations between variables and components

A good PCA solution explains the variance of the original correlation matrix in as few components as possible

Question 23

Q

Interpreting our chosen components

PCA scores

Answer

Study These Flashcards

A

after conducting a PCA you may want to create scores for the new dimensions

simplest method = sum all the scores that ‘belong’ to a component (>0.3 loading)

best method = take the weights into account (i.e. eigen values and vectors)

Question 24

Q

Reporting PCA

Answer

Study These Flashcards

A

Main principles = transparency and reproducibility

methods
- to decide number of factors
- rotation method

Results
- considerations in the choice of number of components
- how many components were retained
- loading matrix of the chosen solution
- variance explained by components
- labelling and interpretation of components

Data Reduction: Principal Component Analysis (PCA) Flashcards

(24 cards)