Dimensionality Reduction Flashcards

1
Q

What is Principal Component Analysis or PCA?

A

PCA is used to reduce the dimensionality of large datasets, increasing interpretability while minimizing information loss. It does this by transforming the data into a new set of variables called principal components.

The original matrix of data is separated into two matrices – the loadings and the components which when combined reconstructs the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the properties of PCA?

A

1.
Finds directions that maximize variance.
2.
These directions that are mutually orthogonal (linearly independent).
3.
The first component has the highest variance.
4.
The variation present in the PCs decrease as we move from the first PC to the last one.
5.
The PCs are linear combinations of the original variables/basis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is singular value decomposition?

A

SVD is a mathematical tool used to decompose a matrix into three separate matrices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a scree plot?

A

A scree plot shows the percentage of explained variance on the y-axis and the component number on the x-axis. We expect that the first component explains the most variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why does the first principal component show the most variance?

A

Principal Components are the directions in the data that capture the most variance. They are orthogonal to each other and ordered by the amount of variance they capture.

First Principal Component:

Maximizing Variance: The first principal component is the direction in the data that captures the maximum variance. It is the line that best fits the data in the least squares sense.

Eigenvector: Mathematically, the first principal component is the eigenvector corresponding to the largest eigenvalue of the covariance matrix. This eigenvalue represents the amount of variance captured by the principal component.

Why the First Component Shows the Most Variance:

Optimization: PCA optimizes the direction of the first principal component to capture the maximum variance. This is done by finding the eigenvector of the covariance matrix with the largest eigenvalue.

Projection: When the data is projected onto the first principal component, the spread (variance) of the data along this direction is maximized. This means that the first principal component captures the most significant patterns in the data.

Sequential Process: Subsequent principal components capture the remaining variance in the data, but each is constrained to be orthogonal to the previous components. Therefore, the first component always captures the most variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Sparse PCA?

A

Sparse PCA aims to find principal components that are sparse, meaning that many of the coefficients (loadings) are zero. This sparsity helps in identifying the most important variables and makes the components more interpretable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Tf-Idf transform?

A

This is used to transform text data for topic modeling.

TF stands for term frequency i.e. how frequent is the term in the document.

IDF stands for Inverse Document Frequency which is the term frequency divided by the frequency of documents having the term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is NMF?

A

Non-Negative Matrix Transformation ensures that we do not get negative weights when looking at topic modeling. It makes the data only positive.

The NMF method automatically clusters the data, as it minimizes the Frobenius norm.
The solution (resulting matrices) is NOT unique.
β—¦ In general, the following property holds:
𝐷 = 𝑄𝑃 = π‘„π΅π΅βˆ’1𝑃
If 𝑄𝐡 and π΅βˆ’1𝑃 are both non-negative, then the outcome is also an equivalent decomposition.
β—¦ To make the problem better behaved, an L1 regularization is applied to the norm, leading to the nonnegative
sparse coding method.
Another name for this technique is Latent Semantic Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is manifold methods in machine learning?

A

Manifold: A manifold is a mathematical space that locally resembles Euclidean space. In the context of machine learning, it refers to the underlying low-dimensional structure within high-dimensional data.

Dimensionality Reduction: Manifold learning aims to reduce the number of dimensions in a dataset while retaining its essential features. This helps in simplifying data visualization and analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are manifold methods useful?

A

When doing
non-linear dimensionality reduction this is what we are doing!
β—¦
Compare this to PCA, that creates a rotation of the space, but leaves the distances alone.
A non
-linear method will modify the distances locally to accomplish compression.
β—¦
Some distances will be preserved.
β—¦
Some will be distorted.
The idea is to preserve the
topology of the space when reducing dimensions.
β—¦
Topology β€œstudies properties of spaces that are invariant under any continuous deformation. It is sometimes called β€˜rubber-sheet geometry’”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain how t-SNE works

A

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a popular algorithm for visualizing high-dimensional data in a lower-dimensional space, typically 2D or 3D. Here’s a step-by-step explanation of how it works:

How t-SNE Works:
High-Dimensional Probability Distribution:

  1. t-SNE starts by calculating the probability that pairs of high-dimensional data points are similar. This is done by measuring the distance between points and converting these distances into probabilities. Similar points have higher probabilities, while dissimilar points have lower probabilities.
    Low-Dimensional Probability Distribution:
  2. Next, t-SNE defines a similar probability distribution for the points in the lower-dimensional space. The goal is to ensure that the low-dimensional representation preserves the pairwise similarities of the high-dimensional data.
    Minimizing Divergence:
  3. The algorithm then minimizes the Kullback-Leibler (KL) divergence between the two probability distributions (high-dimensional and low-dimensional). KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. By minimizing this divergence, t-SNE ensures that similar points in high-dimensional space remain close together in the low-dimensional space, and dissimilar points are far apart. KL is our LOSS FUNCTION.

Optimization:

  1. t-SNE uses gradient descent to iteratively adjust the positions of the points in the low-dimensional space to minimize the KL divergence. This optimization process continues until the algorithm converges to a stable solution.

Key Characteristics:
Non-Linear: Unlike linear dimensionality reduction techniques like PCA, t-SNE is non-linear, making it better suited for capturing complex structures in the data.
Local Structure: t-SNE focuses on preserving the local structure of the data, meaning it maintains the relationships between nearby points more accurately than those between distant points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some drawbacks of t-SNE?

A
  1. It does not scale well. Calculations are costly.
  2. It cannot work on sparse high-dimensional data directly! Normally it first compresses the
    data with PCA.
  3. It is very expensive in memory as it works with large dense matrices.
  4. It only preserves local structure. And you have to be very careful with the perplexity
    parameter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does UMAP work?

A

UMAP tries to handle these problems by dealing with more expensive parts of the t-SNE equations, while
also adding a few intelligent tricks. Its principles are the same though!
In UMAP case:
β—¦ The probability 𝑝𝑖|𝑗 is now modelled as an exponential. It is also allowed to use any distance 𝑑(π‘₯𝑖 , π‘₯𝑗 ) not just
Euclidian

Where πœŒπ‘– is the minimum distance parameter, or the closest distance we will allow a point to look for
neighbours (closer points are ignored, or β€œclumped” together. Note that the probabilities are not normalized thus
making UMAP significantly more efficient than t-SNE.

UMAP also does not use perplexity but uses the number of nearest neighbours to determine the distributions

Finally, UMAP uses a different loss function.
Binary Cross-Entropy. This follows likelihood methods, trees, etc

Also, and finally, UMAP initializes the mapping using spectral clustering.

As with t-SNE, there are two very important parameters
β—¦
The number of nearest neighbours.
β—¦
The minimum distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the main takeaways comparing t-SNE and UMAP?

A

Manifold methods work by compressing the space in a non-linear manner.
As such, they can be arbitrary and require careful selection of parameters.
Two methods: t-SNE and UMAP.

UMAP is better grounded in theory and more efficient, but less accepted than t
-SNE.

t-SNE is only good for plotting in two or three dimensions, use UMAP for more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly