Dimensionality reduction Flashcards
What are the main motivations for reducing a dataset’s dimensionality(3)? what are the main drawbacks(4)?
1) speed up subsequent training algorithm
2) to visualize the data and gain insights on the most important features
3) to save space (data compression)
The main drawbacks are:
1) Some information is lost
2) Can be computationally intensive
3) It adds some complexity to your machine learning pipelines.
4) Transformed features are often hard to interpret.
What is the curse of dimensionality?
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
For example is the fact that randomly sampled high-dimensional vectors are generally very sparse, increasing the risk of overfitting.
Once a dataset’s dimensionality has been reduced, is it possible to reverse the operation? If so, how? if not, why?
It is almost always impossible to perfectly reverse the operation because some information gets lost during dimensionality reduction. But it is possible to estimate with good accuracy what the original dataset looked like.
Can PCA be used to reduce the dimensionality of a highly nonlinear dataset?
PCA can be used to significantly reduce the dimensionality of most datasets, even if they are highly nonlinear, because it can at least get rid of useless dimensions.
Suppose you perform PCA on a 1000 dimensional dataset, setting the explained variance ratio to 95%. How many dimensions will the resulting dataset have?
Let look at the two extreme:
First, suppose the dataset is composed of points that are almost perfectly aligned. In this case, PCA can reduce the dataset down to just 1 dimension while still preserving 95% of the variance.
Second suppose a dataset with perfectly random point, it will lower the feature to 95% of 1000 which is 950.
In what cases would you use vanilla PCA, Incremental PCA, or kernel PCA?
1) Regular PCA is the default, but it works only if the dataset fits in memory.
2) incremental PCA is useful for large datasets that do not fit in memory, but it is slower.
3) Kernal PCA is usefull for nonlinear datasets.
How can you evaluate the performance of a dimensionality reduction algorithm on your dataset?
Intuitively, a dimensionality reduction algorithm performs well if it eliminates a lot of dimension from the dataset without losing too much information. One way to measure this is to apply the reverse transformation and measure the reconstruction error. However, not all dimensionality reduction algorithms provide a reverse transformation.
Does it make any sense to chain two different dimensionality reduction algorithyms?
It can absolutely make sense. A common example is using PCA to quickly get rid of a largue number of useless dimension, then applying another much slower dimensionality reduction algorithm, such as LLE.
What is the main problem of dimensionality reduction?
You lose some information.
What is the main idea of manifold learning.
Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high.
What is the main idea of principal component analysis (PCA)?
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
When using PCA what are the principal components?
PCA identifies the axis that accounts for the largest amount of variance in the training set. This axis would be our 1th principal component (PC). The algorithm then proceeds to find the second axis, orthogonal to the first one, with the highest variance. That would be our 2th PC. The algorithm then proceeds to the third one and so on.
How do you find the principal components (PC) when using PCA?
We need to apply a standard matrix factorization technique called singular value decomposition (SVD) that can decompose the training set matrix X into the matrix multiplication of three matrices:
X = UΣVT
Where V contains the unit vectors that define all the principal components that we are looking for.
When using PCA, once you used the singular value decomposition:
X = UΣVT
How do you project down the data to d dimensions?
Xd-proj = XWd
Where Xd-proj is the reduced dataset of dimensionality d, X is the original dataset ,and Wd is defined as the matrix containing the first d columns of V. In other words, Wd contain the orthogonal vectors that induce the lowest reduction in the variance.
note in sklearn pca.explained_variance_ratio tell you the amount of variance contain in each orthogonal axis.
What is locally linear embedding (LLE), and how does it work?
It is a manifold learning technique that does not rely on projections.LLE begins by:
1) finding a set of the nearest neighbors of each point.
2) It then computes a set of weights for each point that best describes the point as a linear combination of its neighbors.
3) Finally, it uses an eigenvector-based optimization technique to find the low-dimensional embedding of points, such that each point is still described with the same linear combination of its neighbors.
Note: LLE tends to handle non-uniform sample densities poorly because there is no fixed unit to prevent the weights from drifting as various regions differ in sample densities.