Compressing Data via Dimensionality Reduction Flashcards

1
Q

What is PCA?

A

Principal Component Analysis - for unsupervised data compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is LDA?

A

Linear Discriminant Analysis - a supervised dimensionality reduction technique for maximizing class separability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe PCA in a nutshell.

A

PCA aims to find the directions of maximum variance in high-dimensional data and project s it onto a new subspace with equal or fewer dimensions than the original one. The orthogonal axes (principal components) of the new subspace can be interpreted as the directions of maximum variance given the constraint that the new feature axes are orthogonal to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is PCA sensitive to?

A

data scaling. We need to standardize the features prior to PCA if the features were measured on different scales and we want to assign equal importance to all features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give a summary of the PCA steps.

A

1) Standardize the d-dimentional dataset.
2) Construct the covariance matrix.
3) Decompose the covariance matrix into its eigenvectors and eigenvalues.
4) Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the function for covariance between two features xj and xk

A

= (1/n)*SUM(i) (xj^i - mean(xj) ) ( xk^i - mean(xk)) where mean(xj) & mean(xk) are the sample means of feature j and k. Note the sample means are zero if we standardize the dataset. A positive covariance between two features indicates that the features increase or decrease together, whereas a negative covariance indicates that the features vary in opposite directions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What role do eigenvectors and eigenvalues play in PCA?

A

The eigenvectors of the covariance matrix represent the principal components (the directions of maximum variance), whereas the corresponding eigenvalues will define their magnitude.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the variance explained ratio of an eigenvalue?

A

The fraction of an eigenvalue and the total sum of the eigenvalues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the general concept behind LDA?

A

The goal in LDS is to find the feature subspace that optimizes class separability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are assumptions in LDA?

A

That the data is normally distributed. Also, that the classes have identical covariance matrices and that the features are statistically independent of each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Summarize the key steps of the LDA approach.

A

1) Standardize the d-dimensional dataset (d is the number of features)
2) For each class, compute the d-dimensional mean vector.
3) Construct the between-class scatter matrix Sb and the within-class scatter matrix Sw..
4) Compute the eigenvectors and corresponding eigenvalues of the matrix Sw^-1*Sb
5) Choose the k eigenvectors that correspond to the k largest eigenvalues to construct a dxk-dimensional transformation matrix W; the eigenvectors are the columns of this matrix.
6) Project the samples onto the new feature subspace using the transformation matrix W.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a mean vector?

A

Each mean vector Mi stores the mean feature value Um with respect to the samples of class i: Mi = (1/n)*SUM(x) for that class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe maximum likelihood

A

In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the individual scatter matrices Si of each individual class i?

A

Si = Sum( (x-Mi)*transpose(x-Mi) ) from x in the set of class i

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you compute the within-class scatter matrix Sw?

A

Sw = Sum i=1 -> c (Si) where Si is the scatter matrices of each individual class i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why would you scale the individual scatter matrices Si before summing them up as scatter matrix Sw and what does this mean?

A

We are operating under the assumption of class labels in the training set being uniformly distributed. Since they are not guaranteed to be, we divide the scatter matrices by the number of class samples Ni. We see that computing the scatter matrix is in fact the same as computing the covariance matrix. The covariance matrix is a normalized version of the scatter matrix: CVi = (1/Ni)*Sw

17
Q

How do you calculate the between-class scatter matrix?

A

Sb = Sum(i=1 -> c) (Ni * (Mi - m)*transpose(Mi - m) )) Where m is the overall mean that is computed, including samples from all classes.

18
Q

Why would we use kernel PCA?

A

If we are dealing with nonlinear problems, linear transformation techniques for dimensionality reduction, such as PCA and LDA, may not be the best choice.

19
Q

What happens in kernel PCA?

A

We perform a nonlinear mapping that transforms the data onto a higher-dimensional space and use standard PCA in this higher-dimensional space to project the data back onto a lower-dimensional space where the samples can be separated by a linear classifier (under the condition that the samples can be separated by density in the input space). However, one downside of this approach is that it is computationally very expensive, and this is where we use the kernel trick. Using the kernel trick, we can compute the similarity between two high-dimension feature vectors in the original feature space.

20
Q

What are Kernels?

A

What are kernels?
A kernel is a similarity function. It is a function that you, as the domain expert, provide to a machine learning algorithm. It takes two inputs and spits out how similar they are.

Suppose your task is to learn to classify images. You have (image, label) pairs as training data. Consider the typical machine learning pipeline: you take your images, you compute features, you string the features for each image into a vector, and you feed these “feature vectors” and labels into a learning algorithm.

Data –> Features –> Learning algorithm

Kernels offer an alternative. Instead of defining a slew of features, you define a single kernel function to compute similarity between images. You provide this kernel, together with the images and labels to the learning algorithm, and out comes a classifier.

Of course, the standard SVM/ logistic regression/ perceptron formulation doesn’t work with kernels : it works with feature vectors. How on earth do we use kernels then? Two beautiful mathematical facts come to our rescue:
Under some conditions, every kernel function can be expressed as a dot product in a (possibly infinite dimensional) feature space ( Mercer’s theorem ).
Many machine learning algorithms can be expressed entirely in terms of dot products.
These two facts mean that I can take my favorite machine learning algorithm, express it in terms of dot products, and then since my kernel is also a dot product in some space, replace the dot product by my favorite kernel. Voila!

21
Q

Why kernels, as opposed to feature vectors?

A

One big reason is that in many cases, computing the kernel is easy, but computing the feature vector corresponding to the kernel is really really hard. The feature vector for even simple kernels can blow up in size, and for kernels like the RBF kernel ( k(x,y) = exp( -||x-y||^2), see Radial basis function kernel) the corresponding feature vector is infinite dimensional. Yet, computing the kernel is almost trivial.

Many machine learning algorithms can be written to only use dot products, and then we can replace the dot products with kernels. By doing so, we don’t have to use the feature vector at all. This means that we can work with highly complex, efficient-to-compute, and yet high performing kernels without ever having to write down the huge and potentially infinite dimensional feature vector. Thus if not for the ability to use kernel functions directly, we would be stuck with relatively low dimensional, low-performance feature vectors. This “trick” is called the kernel trick