Lecture 5 - Dimensionality Reduction - Principal Component Analysis, Linear Discrimination Analysis, Singular Value Decomposition Flashcards
What is meant by “Degrees of freedom”?
Degrees of Freedom refers to the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample
What is dimensionality reduction?
Dimensionality reduction is the process of deriving a set of degrees of freedom which can be used to reproduce most of the variability of a data set
What is the goal of Dimensionality Reduction? And in broad terms how does it work?
Goal: To reduce dimensions by removing redundant and dependent features
How: By transforming features from higher dimensional space to a lower dimensional space
What are the different methods that can help us reduce dimensions?
Unsupervised where no need for labelling classes of data:
- Independent Component Analysis (ICA)
- Non-negative Matrix Factorization (NMF)
- Principal Component Analysis (PCA)
- Ideal for visualization and noise removal
Supervised where class labels are considered:
- Mixture Discriminant Analysis (MDA)
- Linear Discrimanant Analysis (LDA)
- Ideal for biometrics, Bioinformatics and chemistry
What is Principal Component Analysis (PCA)?
PCA is
- A popular technique for dimensionality reduction.
- A “classical” approach that only characterize linear sub-spaces in data
Involves a dataset with observations on numerical variables
- An exploratory data analysis tool
- A simple, non-parametric method of extracting relevant information from data sets
How does PCA reduce dimensions?
PCA reduces dimensions by exposing underlying information in data sets
- An unsupervised approach
- Aims to explain most of the variability in data with a smaller number of variables
- Identifies axis that accounts for the largest amount of variance in the training set
You should not use PCA if the data is…
showing some non-linearity
What are the three different types of PCA?
- Randomized PCA quickly finds an approximation of the first d principal components.
- Issue: Whole training set need to fit in memory
- Incremental PCA (IPCA) splits the traning set into mini-batches and feed an IPCA algorithm one mini-batch at a time
- Kernel PCA helps perform complex nonlinear projections for dimensionality reduction
How can we calculate the PCA?
Primary PCA calculation steps:
- Calculate covariance matrix
- Calculate ordered eigenvalues and eigenvectors of the matrix
- Compute principal components
How do we calculate the Principal Components?
Overall PC calculation process:
- For each PC:
- PCA finds a zero-centered unit vector pointing in the direction of PC.
- Direction of unit vectors returned by PCA is not stable
- If you perturb training set slightly and run PCA again
- Unit vectors may point in opposite direction as original vectors
- Still, they will lie on same axes
- Unit vectors may point in opposite direction as original vectors
(Don’t know how important it is to remember this)
What are the key characteristics of Linear Discriminant Analysis(LDA)?
Linear Discriminant analysis
- Works as a pre-processing step
- Is a supervised technique
What are the different types of LDA?
Types to deal with classes: Class-dependent and class-independent
Class-dependent LDA: One separate lower dimensional space is calculated for each class to project its data on it
Class-independent LDA: Each class will be considered as a separate class against other classes. - There is just one lower dimensional space for all classes to project their data on it
What are the steps of calculating LDA?
Goal: Project original data matrix onto a lower dimensional space.
Step 1: Between-class variance/matrix: - Calculate separability between different classes (i.e. the distance between the means of different classes).
Step 2: Within-class variance/matrix - Calculate distance between the mean and the samples of each class
Step 3: Construct lower dimensional space - By maximizing between-class variance and minimizing within-class variance
What are the issues with LDA?
Issues:
Small Sample Problem (SSP): Fails to find lower dimensional space
- If dimensions > number of samples
- Here within-class matrix becomes singular
Linearity problem: Cannot discriminate between classes
- If different classes are non-linearly separable
What are the differences in how LDA works vs PCA?
PCA detects the directions of maximal variance
LDA finds subspace that maximizes class separability
What is Singular Value Decomposition (SVD)?
SVD is a method for transforming correlated variables into a set of uncorrelated variables
- To better expose various relationships among original data items
SVD is a method for identifying and ordering dimensions along which data points exhibit most variations
SVD can also be seen as a method for data reduction
What are the basic steps of SVD?
Consider a high dimensional, highly variable set of data points
Reduce it to a lower dimensional space that exposes substructure of original data (more clearly)
Orders it from most variation to least
True or False: SVD is fast even when number of features grows
False. SVD approach can get very slow when number of features grows
True or False: SVD is fast even when number of samples grows
True. SVD can handle large training sets efficiently, provided they can fit in memory
True or False: Training Linear regression model for large number of features is faster using Gradient Descent than using SVD
True
What is variability in data?
Variability (or dispersion) is the extent to which a distribution is stretched or squeezed. Has multiple components: variance, standard deviation, interquartile range, and (I’ve also seen) range
How does a PCA plot get plotted?
A PCA plot converts the correlations (or lack thereof) among all of the features into a 2D graph (or more dimensions, it depends)
Observations that are highly correlated cluster together
How does PCA plot lines?
PCA finds the best fitting line by maximizing the sum of the squared distances from the projected points to the origin.
How many PC should you use?
In a general n(observations) x p(variables) data matrix X, there are up to min(n-1, p) PCs
But there is no fixed method you should use
I think you should use the number of PCs that you consider adequate to contain the most variability of data
How do you define PC 1 and PC 2
PC 1 the linear combination of features that has the highest variance
PC 2 is the linear combination of data sets that has the second-highest variance (it is not correlated to PC 2) and it is orthogonal (perpendicular on PC 1)