3 ML Models of Dispersion (PCA, Clustering, Anomalies) Flashcards
Do visualizations give quantifiable insights?
No
What is dispersion?
Dispersion indicates how much variation there is in some dataset
Number of Distinct Data Points
Radius of Minimum Enclosing Sphere
Average Square Euclidean Distance from the Dataset Mean m:
Various possible measures of dispersion
Principal Component Analysis
Principal Component Analysis (PCA) is a statistical technique that transforms data to identify the directions (principal components) that maximize variance while minimizing reconstruction error, thereby reducing dimensionality while retaining essential information.
Why is it important to maximize dispersion in PCA
Dispersion = Variance = Information, we want to retain as much information as possible
The Lagrange Multiplier Method
The Lagrange multiplier method is used in PCA to maximize the variance of data projections onto a direction
u, but to the constraint that u is a unit vector, by incorporating the constraint into the optimization of the variance function through the Lagrange function.
In PCA, the constraint is that the direction vector u must be a unit vector, which mathematically can be expressed as ∥u∥^2=1 (or equivalently, |u⊤u|=1). This ensures that the projections of the data maintain their relative scales and only capture the directionality of the data’s variance.
How do I rank principal components?
Based on lambda (the eigenvalue) from the Langrangian formula
Is PCA able to work with outliers?
No. Must clean outliers before doing PCA.
Data must be centered to capture true PCA direction, how is this done?
Set mean to zero by subtracting the mean of each feature from the data points so that each feature has a mean of 0. (Different from standardization because the standard deviation after centering isn’t necessarily 1).
Does PCA work for strongly non-Gaussian data?
No. Needs data centering to work in that data won’t vary locally in different directions.
PCA process for multiple components
You take a vector x and identify a direction in which the data varies most (the principal component). This direction can also be represented as a vector and can be broken down into:
- What PCA can capture
- Residual (what PCA can’t capture)
From the residue, a secondary principal component can be found within. Each additional component captures more information from the original data that the previous components missed.
Code for directly computing all eigenvectors/eigenvalues and sorting them by decreasing eigenvalues for PCA in Python (the way with Covariance Matrix (classical way))
numpy.linalg.eigh
PCA Biplot
Visual way to look at data using PCA results.
Uses 2 principal components (directions) that explain the most variation in the data. The two directions are used for x and y axis. Each data point on graph represents an instance from original data.
For each feature (column) in original data, an arrow is drawn in the direction that the feature most influences. This is called “loading vector”. One feature can be given color instead of loading vector.
Generalized Variance in PCA
The total dispersion of the data. It can also be called Trace.
It is the sum of all diagonal elements (eigenvalues) in the covariance matrix (which captures covariances between features)
? = Sum of all Eigenvalues in Covariance Matrix
Trace (total variance captured in dataset)
How would a biplot with two PC’s with eigenvalues that equate to 80% and 15% of the total dispersion respectively look?
It’d be a nearly linear line.
This is because one direction responsible for most of the information (variance) so the other one is pretty much dependent on that direction.
How would a biplot with two PC’s with eigenvalues that equate to 45% and 45% of the total dispersion respectively look?
It’d be a scattered ball like thing with no correlation.
This is because no direction responsible for most of the information (variance) so they are both pretty independent.
PCA Scree Plot
Bar plot with each bar being a principal component. Y axis is explained variance (eigenvalue) and X is component name. (But can also be cumulative plot sometimes)
What is artifact removal?
Sometimes the Principal Component giving off the most information isn’t relevant to the problem: ie eyes blinking is EEG recordings trying to extract neuronal activity.
In this case we ignore that principal component. Artifact indicates it had a high eigenvalue in comparison to the total dispersion.
What is denoising?
Ignoring principal components with small eigenvalues to focus on the main components involved in describing the data variation. (Remove noise)
Problem with Covariance Matrix
Covariance matrix requires dimensions dxd (where d is number of features (columns) in data set). For lots of datatypes (images or biological samples), the number of features can exceed 10,000. This is super costly to store in memory.
SVD is efficient alternative to avoid making covariance matrix, when rows is less than columns.