12 - Dimension Reduction Flashcards

1
Q

What is high dimensionality in data science?

A

High dimensionality refers to a data set with a large number of predictors, for example, 100 predictors describe a 100-dimensional space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is multicollinearity?

A

Multicollinearity occurs when there is substantial correlation among predictor variables, leading to unstable regression models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is double-counting in the context of regression models?

A

Double-counting occurs when highly correlated predictors overemphasize a particular aspect of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the curse of dimensionality?

A

As dimensionality increases, the volume of the predictor space grows exponentially, making the high-dimensional space sparse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the principle of parsimony state?

A

The principle of parsimony suggests that models should be simple and interpretable, keeping the number of predictors manageable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is overfitting in regression models?

A

Overfitting occurs when too many predictors are included in the model, hindering generality to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the risk of missing the bigger picture in data analysis?

A

Focusing solely on individual predictors may overlook the fundamental relationships among them, which can be grouped into components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three main objectives of dimension reduction methods?

A
  • Reduce the number of predictor items
  • Ensure that these predictor items are uncorrelated
  • Provide a framework for interpretability of the results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does multicollinearity lead to in regression analysis?

A

Multicollinearity leads to instability in the solution space, causing unreliable regression coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to regression coefficients when predictors are correlated?

A

The coefficients can vary widely across different samples, making them unreliable for interpretation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can variance inflation factors (VIFs) indicate multicollinearity?

A

A large VIF indicates that a predictor is highly correlated with other predictors, with VIF ≥ 5 indicating moderate and VIF ≥ 10 indicating severe multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for calculating VIF?

A

VIF_i = 1 / (1 - R_i^2), where R_i^2 is the R-squared value obtained by regressing predictor i on the other predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a VIF of 6.85 indicate?

A

A VIF of 6.85 indicates moderate-to-strong multicollinearity for the predictor variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is principal components analysis (PCA)?

A

PCA seeks to account for the correlation structure of a set of predictor variables using a smaller set of uncorrelated linear combinations, called components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the significance of the first principal component?

A

The first principal component accounts for the greatest variability among the predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False: PCA considers the target variable during analysis.

A

False. PCA acts solely on the predictor variables and ignores the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What should be done to predictors before applying PCA?

A

The predictors should be either standardized or normalized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: The total variability produced by the complete set of m predictors can often be mostly accounted for by a smaller set of k < m __________.

A

[components]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is PCA?

A

Principal Component Analysis (PCA) is a technique for dimension reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does PCA act on?

A

PCA acts solely on the predictor variables and ignores the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the characteristic of the first principal component?

A

The first principal component accounts for greater variability among the predictors than any other component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does the second principal component relate to the first?

A

The second principal component accounts for the second-most variability and is uncorrelated with the first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the purpose of varimax rotation in PCA?

A

Varimax rotation helps in the interpretability of the components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the cumulative variance explained by the first two components in the example?

A

The first two components account for about 52.2% of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What does the eigenvalue criterion suggest for extracting components?
Only components with eigenvalues greater than one should be retained.
26
What is the eigenvalue of 1.0 indicative of?
An eigenvalue of 1.0 indicates that the component explains about 'one predictor's worth' of variability.
27
What is the proportion of variance explained criterion?
The criterion specifies the proportion of total variability that the principal components should account for.
28
How many components should be extracted based on the consensus in the example?
k = 4 components should be extracted.
29
What does the principal component matrix indicate after rotation?
The rotated component matrix provides clearer interpretations of the principal components.
30
What should be done to validate PCA results?
The PCA results should be validated using the test data set.
31
What is indicated by VIFs equal to 1 in the regression of Sales per Visit on principal components?
All VIFs equal 1 indicate the elimination of multicollinearity.
32
What is the purpose of standardizing predictor variables before PCA?
Standardizing ensures that all predictors contribute equally to the analysis.
33
What command is used to perform PCA in Python?
The PCA() command from the sklearn.decomposition module is used.
34
What is the significance of the cutoff in PCA loadings in R?
The cutoff suppresses small PCA weights to enhance interpretability.
35
What is the first step in performing PCA using R?
Import the required data sets and separate the predictor variables.
36
True or False: PCA can increase multicollinearity in the data.
False
37
What does the command 'scale()' do in R?
It standardizes the predictor variables.
38
Fill in the blank: The first principal component is a combination of _______.
Different Items Purchased and Purchase Visits
39
What is the minimum number of predictors needed for eigenvalue criterion to be valid?
At least 20 predictors.
40
What is the purpose of the varimax rotation in PCA?
To simplify the interpretation of the components by maximizing variance among them. ## Footnote Varimax rotation is a method used in factor analysis to achieve a simpler and more interpretable factor structure.
41
What command is used to obtain the correlation of the components in the training data set?
round(cor()) ## Footnote This command calculates the correlation matrix for the scores obtained from PCA.
42
What does the VIF stand for in regression analysis?
Variance Inflation Factor ## Footnote VIF measures how much the variance of an estimated regression coefficient increases when your predictors are correlated.
43
True or False: Multicollinearity adversely affects the ability of the sample regression equation to predict the response variable.
False ## Footnote Multicollinearity does not significantly impact the predictive ability of the regression model.
44
What should be strictly limited when using a multicollinear model?
Estimation and prediction of the target variable ## Footnote Interpretation of the model is inappropriate due to nonsensical individual coefficients.
45
What is the first step in running a regression model after obtaining PCA scores?
Save each component as its own variable ## Footnote This allows for the use of PCA components as predictors in the regression model.
46
Fill in the blank: PCA replaces the original set of m predictors with _______.
principal components ## Footnote Principal components are linear combinations of the original variables that capture the maximum variance.
47
What does the eigenvalue criterion help determine in PCA?
The number of components to retain based on eigenvalues greater than 1 ## Footnote Components with eigenvalues greater than 1 explain more variance than a single original variable.
48
What is the proportion of variance explained criterion in PCA?
It helps determine the number of components to retain by explaining a specified percentage of total variance. ## Footnote Common thresholds are 70%, 80%, or 90% of total variance.
49
What does a high VIF indicate in regression analysis?
That multicollinearity is a problem ## Footnote A VIF value above 5 or 10 is often considered indicative of multicollinearity.
50
What is the relationship between the first and other principal components?
Some principal components may be correlated with the first principal component. ## Footnote Correlation among components can indicate shared variance or underlying relationships.
51
How is multicollinearity defined in regression analysis?
The presence of high correlations among predictor variables ## Footnote Multicollinearity can inflate the variance of coefficient estimates, making them unstable.
52
What does PCA stand for?
Principal Component Analysis ## Footnote PCA is a technique used for dimensionality reduction while preserving as much variance as possible.
53
Which command is used to run a linear regression model in R?
lm() ## Footnote The lm() function is used to fit linear models in R.
54
What is the significance of the output from the vif() command?
It shows the variance inflation factors for the regression model. ## Footnote VIF output helps identify potential multicollinearity issues among predictors.
55
What should be done to predictors before running PCA?
Standardize or normalize them ## Footnote Standardization ensures that each predictor contributes equally to the analysis.
56
What does the correlation matrix show in PCA analysis?
The relationships between predictor variables ## Footnote High correlations among variables indicate redundancy and potential multicollinearity.
57
What is the target variable in the analysis of the red wine dataset?
Wine quality ## Footnote This dataset typically includes various chemical properties of red wine as predictors.
58
How can the number of components to retain be determined?
By combining recommendations from the eigenvalue criterion and the proportion of variance explained criterion. ## Footnote This approach ensures a comprehensive evaluation of component significance.
59
What does the term 'high dimensionality' refer to in data science?
Having a large number of features or predictors in a dataset ## Footnote High dimensionality can lead to overfitting and increased computational cost.
60
What is the main purpose of dimension reduction methods?
To reduce the number of predictors while retaining essential information ## Footnote This helps improve model performance and interpretability.
61
What does a plot of eigenvalues help determine?
How many components to retain based on their explained variance ## Footnote The eigenvalue plot visually represents the variance accounted for by each component.
62
What is the correlation matrix used for in the context of predictors?
To identify highly correlated variables ## Footnote Understanding correlations helps in diagnosing multicollinearity issues.