Dimensionality Reduction Flashcards
Missing Value Ratio
Removing variables that have more missing values than some threshold eg for than 50% missing might be better than dropping samples, or imputation when the reason for missing values isn’t clear
Low variance filter
Features with low variance give low information gain. Consider a feature with a value of 10 for all samples… not informative for the target
High correlation filter
High correlation between two non target variables means they have similar trends and are likely to have similar information. This multicollinearity can bring down performance of some models easily.
Random forest
Comes with feature importance… though permutation importance is safer. Also requires a target variable
Backward feature elimination and forward feature elimination
Train with n features
Calculate model performance
Drop one variable and train model using n-1 features
Find the varieties that produces the smallest change in performance and drop it
Repeat this process until no variable can be dropped
Train a single feature based model for all features
The variable with the best performance is chosen as the starting variable
Repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained.
Repeat until no significant improvement is seen in model
Both are time consuming and expensive
Factor analysis
Variables are groups by their correlation ie variables in a group have a high correlation amongst themselves but low correlation with other groups. Each group is a factor. There are not many factors in compared to the original dimensions of the data. It is hard to observe these factors individually.
PCA
Extracts a new set of variables (principal components) from the original set as a linear combination of the original set, where the parametes are the eigenvalues. The first PC contains the highest variance of the data. Second PC contains the next remaining amount of variance as much as possible and is uncorrelated to the first PC. Etc
principal components areeigenvectorsof the data’scovariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix orsingular value decompositionof the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related tofactor analysis
covariance matrix(also known asauto-covariance matrix,dispersion matrix,variance matrix, orvariance–covariance matrix) is a squarematrixgiving thecovariancebetween each pair of elements of a givenrandom vector. Anycovariancematrix issymmetricandpositive semi-definiteand its main diagonal containsvariances(i.e., the covariance of each element with itself).
Cov[X,X] = E[(X-mu)(X-mu)T] = E[XXT] - mumuT
PCA can be thought of as fitting ap-dimensionalellipsoidto the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. Then, we compute thecovariance matrixof the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform our covariance matrix into a diagonalised form with the diagonal elements representing the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues.
Independent component analysis
PCA looks for uncorrelated factors, ICA looks for independent factors
If 2 variables are uncorrelated, it means there is no linear relation between them. If they are independent, it means they are not dependent on other variables. This algorithm assumes that the given variables are linear mixtures of some unknown latent space. It also assumes that theee latent avarisb,avaliable, are mutually independent that they do not depend on other variables.
For pca:
X = WA
X is the ovpbservations W is the mixing matrix and A is the source or the independent components.
Most common measure of independence of components is non gaussianity
Central limit theorem;: distribution of the sum of independent components tend to be normally distributed
So we can look for the transformations that maximize the kurtosis of each component of the independent components. Kurtosis is the third order moment of the distribution. Maximizing kurtosis will, make the distribution non gaussian.
See stack exchange for why non Gaussian variables are independent
Protection based techniques
TSNE
Tries to map nearby points on the manifold to nearby points in the lo dimension. It attempts to preserve geometry. It calculates the probability similarity in high dimension space and low d
High d Euclidean distances between data points are converted into conditional probabilities that represent similarities. Something is done between points in low d. Then it uses KLDivergence to minimize rge difference between both probability distributions.
High loss of information and show computation time
UMAP
Similar to tsne but uses knn and stochastic Gradient Descent.
It calculates the distance between points in high dimensional space., projects them onto low dimension and calculates the distance in low dimensional space. Uses sgd to minimize the distance difference
ISOMAP
tries to recover full low dimension representation of a smooth non linear manifold. Think of rolling out a Swiss roll. If also assumes that the geodesic distance between two points is equal to its Euclidean distance….