Data Science using Python and R - 12 Flashcards
What is high dimensionality in data science?
High dimensionality refers to a data set with a large number of predictors, for example, 100 predictors describe a 100-dimensional space.
What is multicollinearity?
Multicollinearity occurs when there is substantial correlation among predictor variables, leading to unstable regression models.
What is double-counting in the context of regression models?
Double-counting occurs when highly correlated predictors overemphasize a particular aspect of the model.
What is the curse of dimensionality?
As dimensionality increases, the volume of the predictor space grows exponentially, making the high-dimensional space sparse.
What does the principle of parsimony state?
The principle of parsimony suggests that models should be simple and interpretable, keeping the number of predictors manageable.
What is overfitting in regression models?
Overfitting occurs when too many predictors are included in the model, hindering generality to new data.
What is the risk of missing the bigger picture in data analysis?
Focusing solely on individual predictors may overlook the fundamental relationships among them, which can be grouped into components.
What are the three main objectives of dimension reduction methods?
- Reduce the number of predictor items
- Ensure that these predictor items are uncorrelated
- Provide a framework for interpretability of the results.
What does multicollinearity lead to in regression analysis?
Multicollinearity leads to instability in the solution space, causing unreliable regression coefficients.
What happens to regression coefficients when predictors are correlated?
The coefficients can vary widely across different samples, making them unreliable for interpretation.
How can variance inflation factors (VIFs) indicate multicollinearity?
A large VIF indicates that a predictor is highly correlated with other predictors, with VIF ≥ 5 indicating moderate and VIF ≥ 10 indicating severe multicollinearity.
What is the formula for calculating VIF?
VIF_i = 1 / (1 - R_i^2), where R_i^2 is the R-squared value obtained by regressing predictor i on the other predictors.
What does a VIF of 6.85 indicate?
A VIF of 6.85 indicates moderate-to-strong multicollinearity for the predictor variable.
What is principal components analysis (PCA)?
PCA seeks to account for the correlation structure of a set of predictor variables using a smaller set of uncorrelated linear combinations, called components.
What is the significance of the first principal component?
The first principal component accounts for the greatest variability among the predictors.
True or False: PCA considers the target variable during analysis.
False. PCA acts solely on the predictor variables and ignores the target variable.
What should be done to predictors before applying PCA?
The predictors should be either standardized or normalized.
Fill in the blank: The total variability produced by the complete set of m predictors can often be mostly accounted for by a smaller set of k < m __________.
[components]
What is PCA?
Principal Component Analysis (PCA) is a technique for dimension reduction.
What does PCA act on?
PCA acts solely on the predictor variables and ignores the target variable.
What is the characteristic of the first principal component?
The first principal component accounts for greater variability among the predictors than any other component.
How does the second principal component relate to the first?
The second principal component accounts for the second-most variability and is uncorrelated with the first.
What is the purpose of varimax rotation in PCA?
Varimax rotation helps in the interpretability of the components.
What is the cumulative variance explained by the first two components in the example?
The first two components account for about 52.2% of the variance.
What does the eigenvalue criterion suggest for extracting components?
Only components with eigenvalues greater than one should be retained.
What is the eigenvalue of 1.0 indicative of?
An eigenvalue of 1.0 indicates that the component explains about ‘one predictor’s worth’ of variability.
What is the proportion of variance explained criterion?
The criterion specifies the proportion of total variability that the principal components should account for.
How many components should be extracted based on the consensus in the example?
k = 4 components should be extracted.
What does the principal component matrix indicate after rotation?
The rotated component matrix provides clearer interpretations of the principal components.
What should be done to validate PCA results?
The PCA results should be validated using the test data set.
What is indicated by VIFs equal to 1 in the regression of Sales per Visit on principal components?
All VIFs equal 1 indicate the elimination of multicollinearity.
What is the purpose of standardizing predictor variables before PCA?
Standardizing ensures that all predictors contribute equally to the analysis.
What command is used to perform PCA in Python?
The PCA() command from the sklearn.decomposition module is used.
What is the significance of the cutoff in PCA loadings in R?
The cutoff suppresses small PCA weights to enhance interpretability.
What is the first step in performing PCA using R?
Import the required data sets and separate the predictor variables.
True or False: PCA can increase multicollinearity in the data.
False
What does the command ‘scale()’ do in R?
It standardizes the predictor variables.
Fill in the blank: The first principal component is a combination of _______.
Different Items Purchased and Purchase Visits
What is the minimum number of predictors needed for eigenvalue criterion to be valid?
At least 20 predictors.
What is the purpose of the varimax rotation in PCA?
To simplify the interpretation of the components by maximizing variance among them.
Varimax rotation is a method used in factor analysis to achieve a simpler and more interpretable factor structure.
What command is used to obtain the correlation of the components in the training data set?
round(cor())
This command calculates the correlation matrix for the scores obtained from PCA.
What does the VIF stand for in regression analysis?
Variance Inflation Factor
VIF measures how much the variance of an estimated regression coefficient increases when your predictors are correlated.
True or False: Multicollinearity adversely affects the ability of the sample regression equation to predict the response variable.
False
Multicollinearity does not significantly impact the predictive ability of the regression model.
What should be strictly limited when using a multicollinear model?
Estimation and prediction of the target variable
Interpretation of the model is inappropriate due to nonsensical individual coefficients.
What is the first step in running a regression model after obtaining PCA scores?
Save each component as its own variable
This allows for the use of PCA components as predictors in the regression model.
Fill in the blank: PCA replaces the original set of m predictors with _______.
principal components
Principal components are linear combinations of the original variables that capture the maximum variance.
What does the eigenvalue criterion help determine in PCA?
The number of components to retain based on eigenvalues greater than 1
Components with eigenvalues greater than 1 explain more variance than a single original variable.
What is the proportion of variance explained criterion in PCA?
It helps determine the number of components to retain by explaining a specified percentage of total variance.
Common thresholds are 70%, 80%, or 90% of total variance.
What does a high VIF indicate in regression analysis?
That multicollinearity is a problem
A VIF value above 5 or 10 is often considered indicative of multicollinearity.
What is the relationship between the first and other principal components?
Some principal components may be correlated with the first principal component.
Correlation among components can indicate shared variance or underlying relationships.
How is multicollinearity defined in regression analysis?
The presence of high correlations among predictor variables
Multicollinearity can inflate the variance of coefficient estimates, making them unstable.
What does PCA stand for?
Principal Component Analysis
PCA is a technique used for dimensionality reduction while preserving as much variance as possible.
Which command is used to run a linear regression model in R?
lm()
The lm() function is used to fit linear models in R.
What is the significance of the output from the vif() command?
It shows the variance inflation factors for the regression model.
VIF output helps identify potential multicollinearity issues among predictors.
What should be done to predictors before running PCA?
Standardize or normalize them
Standardization ensures that each predictor contributes equally to the analysis.
What does the correlation matrix show in PCA analysis?
The relationships between predictor variables
High correlations among variables indicate redundancy and potential multicollinearity.
What is the target variable in the analysis of the red wine dataset?
Wine quality
This dataset typically includes various chemical properties of red wine as predictors.
How can the number of components to retain be determined?
By combining recommendations from the eigenvalue criterion and the proportion of variance explained criterion.
This approach ensures a comprehensive evaluation of component significance.
What does the term ‘high dimensionality’ refer to in data science?
Having a large number of features or predictors in a dataset
High dimensionality can lead to overfitting and increased computational cost.
What is the main purpose of dimension reduction methods?
To reduce the number of predictors while retaining essential information
This helps improve model performance and interpretability.
What does a plot of eigenvalues help determine?
How many components to retain based on their explained variance
The eigenvalue plot visually represents the variance accounted for by each component.
What is the correlation matrix used for in the context of predictors?
To identify highly correlated variables
Understanding correlations helps in diagnosing multicollinearity issues.