12 - Dimension Reduction Flashcards
What is high dimensionality in data science?
High dimensionality refers to a data set with a large number of predictors, for example, 100 predictors describe a 100-dimensional space.
What is multicollinearity?
Multicollinearity occurs when there is substantial correlation among predictor variables, leading to unstable regression models.
What is double-counting in the context of regression models?
Double-counting occurs when highly correlated predictors overemphasize a particular aspect of the model.
What is the curse of dimensionality?
As dimensionality increases, the volume of the predictor space grows exponentially, making the high-dimensional space sparse.
What does the principle of parsimony state?
The principle of parsimony suggests that models should be simple and interpretable, keeping the number of predictors manageable.
What is overfitting in regression models?
Overfitting occurs when too many predictors are included in the model, hindering generality to new data.
What is the risk of missing the bigger picture in data analysis?
Focusing solely on individual predictors may overlook the fundamental relationships among them, which can be grouped into components.
What are the three main objectives of dimension reduction methods?
- Reduce the number of predictor items
- Ensure that these predictor items are uncorrelated
- Provide a framework for interpretability of the results.
What does multicollinearity lead to in regression analysis?
Multicollinearity leads to instability in the solution space, causing unreliable regression coefficients.
What happens to regression coefficients when predictors are correlated?
The coefficients can vary widely across different samples, making them unreliable for interpretation.
How can variance inflation factors (VIFs) indicate multicollinearity?
A large VIF indicates that a predictor is highly correlated with other predictors, with VIF ≥ 5 indicating moderate and VIF ≥ 10 indicating severe multicollinearity.
What is the formula for calculating VIF?
VIF_i = 1 / (1 - R_i^2), where R_i^2 is the R-squared value obtained by regressing predictor i on the other predictors.
What does a VIF of 6.85 indicate?
A VIF of 6.85 indicates moderate-to-strong multicollinearity for the predictor variable.
What is principal components analysis (PCA)?
PCA seeks to account for the correlation structure of a set of predictor variables using a smaller set of uncorrelated linear combinations, called components.
What is the significance of the first principal component?
The first principal component accounts for the greatest variability among the predictors.
True or False: PCA considers the target variable during analysis.
False. PCA acts solely on the predictor variables and ignores the target variable.
What should be done to predictors before applying PCA?
The predictors should be either standardized or normalized.
Fill in the blank: The total variability produced by the complete set of m predictors can often be mostly accounted for by a smaller set of k < m __________.
[components]
What is PCA?
Principal Component Analysis (PCA) is a technique for dimension reduction.
What does PCA act on?
PCA acts solely on the predictor variables and ignores the target variable.
What is the characteristic of the first principal component?
The first principal component accounts for greater variability among the predictors than any other component.
How does the second principal component relate to the first?
The second principal component accounts for the second-most variability and is uncorrelated with the first.
What is the purpose of varimax rotation in PCA?
Varimax rotation helps in the interpretability of the components.
What is the cumulative variance explained by the first two components in the example?
The first two components account for about 52.2% of the variance.