Chapter 3. Dimensionality Reduction Flashcards
Two major branches of dimensionality reduction? P 120
Linear projection
Manifold learning, which is also referred to as nonlinear dimensionality reduction
What techniques does linear projection include? P 120
Principal component analysis
Singular value decomposition
Random projection
Which techniques does manifold learning include? P 120
Isomap
Multidimensional scaling (MDS)
Locally linear embedding (LLE)
T-distributed stochastic neighbor embedding (t-SNE)
Dictionary learning
Random trees embedding
Independent component analysis
What kind of distance measure does isomap learn? P 120
It learns the curved distance (also called the geodesic distance) between points rather than the Euclidean distance.
What are some versions of PCA called? 4 versions P 120
Standard PCA
Incremental PCA
Sparse PCA
Kernel PCA
Is the regenerated matrix using standard PCA features, exactly the same as the original matrix? P 121
With these components, it is possible to reconstruct the original features —not exactly but generally close enough.
What is one essential thing to do before running PCA? P 121
It is essential to perform feature scaling before running PCA.
What is the sklearn pca attribute, for finding the explained variance percentage? P 123
explained_variance_ratio_
What is the trade-off of using PCA? P 128
PCA-reduced feature set may not perform quite as well in terms of accuracy as a model that is trained on the full feature set, but both the training and prediction times will be much faster. This is one of the important trade-offs you must consider when choosing whether to use dimensionality reduction in your machine learning product.
When do we use incremental PCA? P 128
For datasets that are very large and cannot fit in memory, we can perform PCA incrementally in small batches, where each batch is able to fit in memory.
What is sparse PCA? P 130
For some machine learning problems, some degree of sparsity may be preferred. A version of PCA that retains some degree of sparsity—controlled by a hyperparameter called alpha—is known as sparse PCA.
What is the difference between standard PCA and Sparse PCA? P 130
The normal PCA algorithm searches for linear combinations in all the input variables, reducing the original feature space as densely as possible. The sparse PCA algorithm searches for linear combinations in just some of the input variables, reducing the original feature space to some degree but not as compactly as normal PCA.
What is kernel PCA? P 132
Normal PCA, incremental PCA, and sparse PCA linearly project the
original data onto a lower dimensional space, but there is also a
nonlinear form of PCA known as kernel PCA, which runs a similarity
function over pairs of original data points in order to perform nonlinear
dimensionality reduction.
When is kernel PCA especially effective? P 132
This method is especially effective when the original feature set is not linearly separable.
What is gamma hyperparameter in kernel PCA? P 133
kernel coefficient
What is alpha is sparse PCA? P 130
degree of sparsity
What is Singular Value Decomposition? P 134
Another approach to learning the underlying structure of the data is to reduce the rank of the original matrix of features to a smaller rank, such that the original matrix can be recreated using a linear combination of some of the vectors in the smaller rank matrix. This is known as singular value decomposition (SVD).
What is the rank of a matrix? External
The maximum number of its linearly independent columns (or rows ) of a matrix is called the rank of a matrix. The rank of a matrix cannot exceed the number of its rows or columns.
Does the relevant structure of the original feature set remain preserved after random projection? P 136
Yes
What are the two versions of Random Projection? P 136
There are two versions of random projection—the standard version known as Gaussian random projection and a sparse version known as sparse random projection.
What does eps hyperparameter control in Gaussian Random Projection? What do lower values of eps mean? P 137
The eps controls the quality of the embedding according to the Johnson–Lindenstrauss lemma, where smaller values generate a higher number of dimensions
Why does the scatter plot from standard PCA is different from sparse PCA? P 132
Normal and sparse PCA generate principal components differently, and the separation of points is somewhat different, too.
Why does Random Projection scatter plot look very different from PCA family’s scatter plots? P 138
Although it is a form of linear projection like PCA, random projection is an entirely different family of dimensionality reduction. Thus the random projection scatter plot looks very different from the scatter plots of normal PCA, incremental PCA, sparse PCA, and kernel PCA.
What are the advantages of using Sparse Random Projection instead of Gaussian Random Projection? P 138
It is generally much more efficient and faster than normal Gaussian random projection
To which member of the PCA family is Isomap similar? How does it reduce dimensionality? P 140
Like kernel PCA, Isomap learns a new, low-dimensional embedding of the original feature set by calculating the pairwise distances of all the points, where distance is curved or geodesic distance rather than Euclidean distance.
What is Multidimensional scaling? What is it based on? P 141
Multidimensional scaling (MDS) is a form of nonlinear dimensionality reduction that learns the similarity of points in the original dataset and, using this similarity learning, models this in a lower dimensional space
Does Locally Linear Embedding preserve distances within local neighborhoods as it projects the data from the original feature space to a reduced space? P 142
Yes
Is Locally Linear Embedding a linear method of dimensionality reduction or a non-linear one? P 142
Non-Linear
What is the main use of t-SNE dimensionality reduction? P 144
T-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensionality reduction technique for visualizing high-dimensional data.
Why In real-world applications of t-SNE, is it best to use another dimensionality reduction technique (such as PCA, as we do here) to reduce the number of dimensions before applying t-SNE? P 144
By applying another form of dimensionality reduction first, we reduce the noise in the features that are fed into t-SNE and speed up the computation of the algorithm
Why aren’t the results of t-SNE stable? P 145
t-SNE has a nonconvex cost function, which means that different initializations of the algorithm will generate different results. There is no stable solution.
What dimensionality reduction methods don’t rely on geometry or distance metrics? P 146
•Dictionary Learning
•Independent Component Analysis
The dictionary learning, learns the sparse representation of the original data. True or False? P 146
True
What are dictionaries and atoms in dictionary learning? P 146
The resulting matrix is known as the dictionary
The vectors in the dictionary are known as atoms.
Atoms are simple, binary vectors. True or False? P 146
True
What problem does Independent Component Analysis address? P 148
One common problem with unlabeled data is that there are many independent signals embedded together into the features we are given. Using independent component analysis (ICA), we can separate these blended signals into their individual components.
When is ICA used? P 149
ICA is commonly used in signal processing tasks (for example, to identify the individual voices in an audio clip of a busy coffeehouse).
Can PCA work with categorical data? External
While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variables, you should not. Simply put,if your variables don’t belong on a coordinate plane, then do not apply PCA to them.