Principal component analysis Flashcards

1
Q

Define multicolinearlity

A

several independent variables are correlated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

List some reasons why data reduction techniques are needed

A

Big data sets with lots of observations (have higher dimensionality)
when only a few important features contribute to the dataset.

Data reduction allows for the extraction of necessary information from a huge array of data.
Removing redundant data -> removes noise
To see see valid data and identify potential patterns in data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the types of data reduction techniques?

A

Dimensionality reduction: Reducing the number of dimensions the data is spread across e.g.
Principal component analysis
Wavelet transform

Numerosity reduction: uses alternate/small forms of data representation to reduce data volume e.g.
Histograms (for data with multiple attributes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Outline Principal Component Analysis (PCA) (aims)

A

PCA is a linear variable reduction technique

Reduce larger set of variables into a smaller set of artificial variables (components)

PCA aims to reduce a larger set of variables into a smaller set of ‘artificial’ variables (called principal components) that account for most of the variance in the original variables.

(model the interrelationships in the data set, focusing on variance and co-variance. Higher variance wider the spread of data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 major problems PCA can solve?

A

Remove unrelated variables
Remove redundancy in a set of variables
Remove multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe how Principal Component Analysis works.

A

To perform a PCA, line of best fit needs to be found.
This is calculated through the sum of the squared distances.
Data is projected onto the line.
Line is rotated until the largest sum of the squared distances is found.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the goal of PCA

A

To model the interrelationships in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

First principal component and second principal component

What is the condition…

A

The line of best fit describes the greatest variance = the first principal component

The second principal component is perpendicular (accounts for the next highest variance)

The condition is that it is uncorrelated with the first principal component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define Eigenvalue

What does the Eigenvalue indicate and represent

A

Eigenvalue is a distance. It’s the sum of squared distances from the line of best fit = the bigger the better

The eigenvalue indicates how much variance there is in the data or how spread the data is on the line.

It represents the total amount of variance that can be explained by a given principal component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define Eigenvector

A

Eigenvector is a direction (the data is moving in)
The eigenvector with the highest eigenvalue is the principal component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

By ranking eigenvectors in order of their eigenvalues highest to lowest….

A

You get the principal components in order of significance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The highest eigenvalue is…..

A

The principal component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a Scree plot

A

A plot of the total variance (‘its eigenvalue) explained by each component against its respective component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does the inflection point on a scree plot indicate

A

The inflection point represents the point where the graph begins to level out. As subsequent components account for little of the total variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

State the 4 assumptions that must be met in order to run a PCA

A

Have multiple continuous variables
There should be a linear relationship between all the variables
There should be no outliers
There should be a large sample size for PCA to produce reliable results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the goal of post PCA rotation

A

The goal of rotation is to improve the interpretability of the output.

Component rotations help us interpret component loadings.

17
Q

What are the two types of rotations

A

Orthogonal rotation: assumes that components are independent or uncorrelated with each other.
Example: Varimax rotation

Oblique rotation: components are not independent and are correlated.
Example: Direct oblimin

18
Q

Why/when shouldn’t/ is it not recommended that varimax rotation not be used?

A

It shouldn’t be used for clinical datasets because of biological cascades (inter-relationships)
e.g. Organ systems

19
Q

List the steps of PCA

A
  1. Identify linear associations in the data (e.g. to help identify disease systems)
  2. Identify the Principal components based on high eigenvalues
  3. Choose rotation (Varimax or direct oblimin)
  4. Linear regression to predict a continuous variable of choice.
20
Q

Deciding how many principal components to retain: How?
Using what techniques?

A

The eigenvalue-one criterion - An eigenvalue less than one indicates that the component explains less variance than a variable would and hence shouldn’t be retained.
Percentage of variance explained - As the component number increases, each subsequent component explains less of the total variance. It has been suggested that a component should only be retained if it explains at least 5% to 10% of the total variance.
Scree plot - A scree plot is a plot of the total variance (its “eigenvalue”) explained by each component against its respective component. The inflection point is meant to represent the point where the graph begins to level out, as subsequent components account for little of the total variance

21
Q

What is the inflection point on a Scree plot?

A

The inflection point represents the point where the graph begins to level out, as subsequent components account for little of the total variance

22
Q

What is Orthogonal rotation.
Give an example

A

This assumes that components are independent or uncorrelated with each other.

Varimax rotation

23
Q

What is Oblique rotation.
Give an example

A

Components are not independent and are correlated.

Direct Oblimin