Principal component analysis Flashcards

Question 1

Q

Define multicolinearlity

Answer

A

several independent variables are correlated

Question 2

Q

List some reasons why data reduction techniques are needed

Answer

A

Big data sets with lots of observations (have higher dimensionality)
when only a few important features contribute to the dataset.

Data reduction allows for the extraction of necessary information from a huge array of data.
Removing redundant data -> removes noise
To see see valid data and identify potential patterns in data

Question 3

Q

What are the types of data reduction techniques?

Answer

A

Dimensionality reduction: Reducing the number of dimensions the data is spread across e.g.
Principal component analysis
Wavelet transform

Numerosity reduction: uses alternate/small forms of data representation to reduce data volume e.g.
Histograms (for data with multiple attributes)

Question 4

Q

Outline Principal Component Analysis (PCA) (aims)

Answer

A

PCA is a linear variable reduction technique

Reduce larger set of variables into a smaller set of artificial variables (components)

PCA aims to reduce a larger set of variables into a smaller set of ‘artificial’ variables (called principal components) that account for most of the variance in the original variables.

(model the interrelationships in the data set, focusing on variance and co-variance. Higher variance wider the spread of data).

Question 5

Q

What are the 3 major problems PCA can solve?

Answer

A

Remove unrelated variables
Remove redundancy in a set of variables
Remove multicollinearity

Question 6

Q

Describe how Principal Component Analysis works.

Answer

A

To perform a PCA, line of best fit needs to be found.
This is calculated through the sum of the squared distances.
Data is projected onto the line.
Line is rotated until the largest sum of the squared distances is found.

Question 7

Q

What is the goal of PCA

Answer

A

To model the interrelationships in the data.

Question 8

Q

First principal component and second principal component

What is the condition…

Answer

A

The line of best fit describes the greatest variance = the first principal component

The second principal component is perpendicular (accounts for the next highest variance)

The condition is that it is uncorrelated with the first principal component

Question 9

Q

Define Eigenvalue

What does the Eigenvalue indicate and represent

Answer

A

Eigenvalue is a distance. It’s the sum of squared distances from the line of best fit = the bigger the better

The eigenvalue indicates how much variance there is in the data or how spread the data is on the line.

It represents the total amount of variance that can be explained by a given principal component.

Question 10

Q

Define Eigenvector

Answer

A

Eigenvector is a direction (the data is moving in)
The eigenvector with the highest eigenvalue is the principal component

Question 11

Q

By ranking eigenvectors in order of their eigenvalues highest to lowest….

Answer

A

You get the principal components in order of significance

Question 12

Q

The highest eigenvalue is…..

Answer

A

The principal component

Question 13

Q

What is a Scree plot

Answer

A

A plot of the total variance (‘its eigenvalue) explained by each component against its respective component.

Question 14

Q

What does the inflection point on a scree plot indicate

Answer

A

The inflection point represents the point where the graph begins to level out. As subsequent components account for little of the total variance.

Question 15

Q

State the 4 assumptions that must be met in order to run a PCA

Answer

A

Have multiple continuous variables
There should be a linear relationship between all the variables
There should be no outliers
There should be a large sample size for PCA to produce reliable results

Question 16

Q

What is the goal of post PCA rotation

Answer

Study These Flashcards

A

The goal of rotation is to improve the interpretability of the output.

Component rotations help us interpret component loadings.

Question 17

Q

What are the two types of rotations

Answer

Study These Flashcards

A

Orthogonal rotation: assumes that components are independent or uncorrelated with each other.
Example: Varimax rotation

Oblique rotation: components are not independent and are correlated.
Example: Direct oblimin

Question 18

Q

Why/when shouldn’t/ is it not recommended that varimax rotation not be used?

Answer

Study These Flashcards

A

It shouldn’t be used for clinical datasets because of biological cascades (inter-relationships)
e.g. Organ systems

Question 19

Q

List the steps of PCA

Answer

Study These Flashcards

A

Identify linear associations in the data (e.g. to help identify disease systems)
Identify the Principal components based on high eigenvalues
Choose rotation (Varimax or direct oblimin)
Linear regression to predict a continuous variable of choice.

Question 20

Q

Deciding how many principal components to retain: How?
Using what techniques?

Answer

Study These Flashcards

A

The eigenvalue-one criterion - An eigenvalue less than one indicates that the component explains less variance than a variable would and hence shouldn’t be retained.
Percentage of variance explained - As the component number increases, each subsequent component explains less of the total variance. It has been suggested that a component should only be retained if it explains at least 5% to 10% of the total variance.
Scree plot - A scree plot is a plot of the total variance (its “eigenvalue”) explained by each component against its respective component. The inflection point is meant to represent the point where the graph begins to level out, as subsequent components account for little of the total variance

Question 21

Q

What is the inflection point on a Scree plot?

Answer

Study These Flashcards

A

The inflection point represents the point where the graph begins to level out, as subsequent components account for little of the total variance

Question 22

Q

What is Orthogonal rotation.
Give an example

Answer

Study These Flashcards

A

This assumes that components are independent or uncorrelated with each other.

Varimax rotation

Question 23

Q

What is Oblique rotation.
Give an example

Answer

Study These Flashcards

A

Components are not independent and are correlated.

Direct Oblimin

Principal component analysis Flashcards

(23 cards)