PCA Flashcards
What is the basic idea behind dimensionality reduction or PCA?
Replace a large number of predictors with a smaller number, which maintain a good representation of the data
When is reducing dimensionality important?
When an ML model runs into memory or long processing time issues due to big data with too many predictors.
Another use is simply to be able to visualize the data in 2 or 3 dimensions. E.g., scatterplot with labels allows to see which data points (like customers) are close to each other in space, even if the new/reduced axes of that space aren’t meaningful.
Dimension reduction / PCA: After reducing the data to, say, 4 dimensions, what do the axes represent?
The axes are rotated to represent the highest directions of variance in the data.
PCA: main syntax for creating it, fitting, and transforming? (3 lines)
- What are the arguments for the model?
from sklearn.decomposition import PCA
my_pca = PCA(n_components)
my_pca.fit(X) # methods/params become available after this step
X_trans = my_pca.transform(X)
How to run a PCA transform with however many dimensions capture at least 90% of the original features’ variance?
PCA(n_components=.9)
After running my_pca.fit(X), what attributes become available?
.components_ # principal axes’ vectors
.explained_variance_ratio # each axis’ relative contribution to explaining the original variance
What is not a good reason to use dimensionality reduction / PCA?
When you have features that are redundant or collinear (correlated with each other). Including all of them in a model as is might result in overfitting, but simply doing PCA on them first won’t help this issue.
Is PCA a predictor?
No, it is a transformer.
What happens if you run PCA without specifying n_components?
It keeps n_components equal to X’s N features, but still rotates the data to set the N axes along the data’s max variability. So X doesn’t get reduced, but it still gets transformed.
You did PCA then KMeans clustering and found the cluster centers in the reduced/transformed space. How do you get the centroids’ coordinates in the original feature space? In other words, how do you “reverse” PCA?
my_pca.inverse_transform(centroids)
PCA: How do you decide how many dimensions?
Usually you want to capture a min of X% (e.g., 90%) of the variance. However many N components get you that is how many you settle on.