Principal component analysis Flashcards
Define multicolinearlity
several independent variables are correlated
List some reasons why data reduction techniques are needed
Big data sets with lots of observations (have higher dimensionality)
when only a few important features contribute to the dataset.
Data reduction allows for the extraction of necessary information from a huge array of data.
Removing redundant data -> removes noise
To see see valid data and identify potential patterns in data
What are the types of data reduction techniques?
Dimensionality reduction: Reducing the number of dimensions the data is spread across e.g.
Principal component analysis
Wavelet transform
Numerosity reduction: uses alternate/small forms of data representation to reduce data volume e.g.
Histograms (for data with multiple attributes)
Outline Principal Component Analysis (PCA) (aims)
PCA is a linear variable reduction technique
Reduce larger set of variables into a smaller set of artificial variables (components)
PCA aims to reduce a larger set of variables into a smaller set of ‘artificial’ variables (called principal components) that account for most of the variance in the original variables.
(model the interrelationships in the data set, focusing on variance and co-variance. Higher variance wider the spread of data).
What are the 3 major problems PCA can solve?
Remove unrelated variables
Remove redundancy in a set of variables
Remove multicollinearity
Describe how Principal Component Analysis works.
To perform a PCA, line of best fit needs to be found.
This is calculated through the sum of the squared distances.
Data is projected onto the line.
Line is rotated until the largest sum of the squared distances is found.
What is the goal of PCA
To model the interrelationships in the data.
First principal component and second principal component
What is the condition…
The line of best fit describes the greatest variance = the first principal component
The second principal component is perpendicular (accounts for the next highest variance)
The condition is that it is uncorrelated with the first principal component
Define Eigenvalue
What does the Eigenvalue indicate and represent
Eigenvalue is a distance. It’s the sum of squared distances from the line of best fit = the bigger the better
The eigenvalue indicates how much variance there is in the data or how spread the data is on the line.
It represents the total amount of variance that can be explained by a given principal component.
Define Eigenvector
Eigenvector is a direction (the data is moving in)
The eigenvector with the highest eigenvalue is the principal component
By ranking eigenvectors in order of their eigenvalues highest to lowest….
You get the principal components in order of significance
The highest eigenvalue is…..
The principal component
What is a Scree plot
A plot of the total variance (‘its eigenvalue) explained by each component against its respective component.
What does the inflection point on a scree plot indicate
The inflection point represents the point where the graph begins to level out. As subsequent components account for little of the total variance.
State the 4 assumptions that must be met in order to run a PCA
Have multiple continuous variables
There should be a linear relationship between all the variables
There should be no outliers
There should be a large sample size for PCA to produce reliable results