Dimensionality Reduction Flashcards

Question 1

Q

Missing Value Ratio

Answer

A

Removing variables that have more missing values than some threshold eg for than 50% missing might be better than dropping samples, or imputation when the reason for missing values isn’t clear

Question 2

Q

Low variance filter

Answer

A

Features with low variance give low information gain. Consider a feature with a value of 10 for all samples… not informative for the target

Question 3

Q

High correlation filter

Answer

A

High correlation between two non target variables means they have similar trends and are likely to have similar information. This multicollinearity can bring down performance of some models easily.

Question 4

Q

Random forest

Answer

A

Comes with feature importance… though permutation importance is safer. Also requires a target variable

Question 5

Q

Backward feature elimination and forward feature elimination

Answer

A

Train with n features
Calculate model performance
Drop one variable and train model using n-1 features
Find the varieties that produces the smallest change in performance and drop it

Repeat this process until no variable can be dropped

Train a single feature based model for all features
The variable with the best performance is chosen as the starting variable
Repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained.
Repeat until no significant improvement is seen in model

Both are time consuming and expensive

Question 6

Q

Factor analysis

Answer

A

Variables are groups by their correlation ie variables in a group have a high correlation amongst themselves but low correlation with other groups. Each group is a factor. There are not many factors in compared to the original dimensions of the data. It is hard to observe these factors individually.

Question 7

Q

PCA

Answer

A

Extracts a new set of variables (principal components) from the original set as a linear combination of the original set, where the parametes are the eigenvalues. The first PC contains the highest variance of the data. Second PC contains the next remaining amount of variance as much as possible and is uncorrelated to the first PC. Etc

principal components areeigenvectorsof the data’scovariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix orsingular value decompositionof the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related tofactor analysis

covariance matrix(also known asauto-covariance matrix,dispersion matrix,variance matrix, orvariance–covariance matrix) is a squarematrixgiving thecovariancebetween each pair of elements of a givenrandom vector. Anycovariancematrix issymmetricandpositive semi-definiteand its main diagonal containsvariances(i.e., the covariance of each element with itself).

Cov[X,X] = E[(X-mu)(X-mu)T] = E[XXT] - mumuT

PCA can be thought of as fitting ap-dimensionalellipsoidto the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. Then, we compute thecovariance matrixof the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform our covariance matrix into a diagonalised form with the diagonal elements representing the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues.

Question 8

Q

Independent component analysis

Answer

A

PCA looks for uncorrelated factors, ICA looks for independent factors

If 2 variables are uncorrelated, it means there is no linear relation between them. If they are independent, it means they are not dependent on other variables. This algorithm assumes that the given variables are linear mixtures of some unknown latent space. It also assumes that theee latent avarisb,avaliable, are mutually independent that they do not depend on other variables.

For pca:

X = WA
X is the ovpbservations W is the mixing matrix and A is the source or the independent components.

Most common measure of independence of components is non gaussianity

Central limit theorem;: distribution of the sum of independent components tend to be normally distributed
So we can look for the transformations that maximize the kurtosis of each component of the independent components. Kurtosis is the third order moment of the distribution. Maximizing kurtosis will, make the distribution non gaussian.

See stack exchange for why non Gaussian variables are independent

Question 9

Q

Protection based techniques

Answer

A

TSNE
Tries to map nearby points on the manifold to nearby points in the lo dimension. It attempts to preserve geometry. It calculates the probability similarity in high dimension space and low d
High d Euclidean distances between data points are converted into conditional probabilities that represent similarities. Something is done between points in low d. Then it uses KLDivergence to minimize rge difference between both probability distributions.
High loss of information and show computation time

UMAP
Similar to tsne but uses knn and stochastic Gradient Descent.
It calculates the distance between points in high dimensional space., projects them onto low dimension and calculates the distance in low dimensional space. Uses sgd to minimize the distance difference
ISOMAP
tries to recover full low dimension representation of a smooth non linear manifold. Think of rolling out a Swiss roll. If also assumes that the geodesic distance between two points is equal to its Euclidean distance….