Data Science using Python and R - 12 Flashcards

1
Q

What is high dimensionality in data science?

A

High dimensionality refers to a data set with a large number of predictors, for example, 100 predictors describe a 100-dimensional space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is multicollinearity?

A

Multicollinearity occurs when there is substantial correlation among predictor variables, leading to unstable regression models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is double-counting in the context of regression models?

A

Double-counting occurs when highly correlated predictors overemphasize a particular aspect of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the curse of dimensionality?

A

As dimensionality increases, the volume of the predictor space grows exponentially, making the high-dimensional space sparse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does the principle of parsimony state?

A

The principle of parsimony suggests that models should be simple and interpretable, keeping the number of predictors manageable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is overfitting in regression models?

A

Overfitting occurs when too many predictors are included in the model, hindering generality to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the risk of missing the bigger picture in data analysis?

A

Focusing solely on individual predictors may overlook the fundamental relationships among them, which can be grouped into components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the three main objectives of dimension reduction methods?

A
  • Reduce the number of predictor items
  • Ensure that these predictor items are uncorrelated
  • Provide a framework for interpretability of the results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does multicollinearity lead to in regression analysis?

A

Multicollinearity leads to instability in the solution space, causing unreliable regression coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to regression coefficients when predictors are correlated?

A

The coefficients can vary widely across different samples, making them unreliable for interpretation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can variance inflation factors (VIFs) indicate multicollinearity?

A

A large VIF indicates that a predictor is highly correlated with other predictors, with VIF ≥ 5 indicating moderate and VIF ≥ 10 indicating severe multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for calculating VIF?

A

VIF_i = 1 / (1 - R_i^2), where R_i^2 is the R-squared value obtained by regressing predictor i on the other predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does a VIF of 6.85 indicate?

A

A VIF of 6.85 indicates moderate-to-strong multicollinearity for the predictor variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is principal components analysis (PCA)?

A

PCA seeks to account for the correlation structure of a set of predictor variables using a smaller set of uncorrelated linear combinations, called components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the significance of the first principal component?

A

The first principal component accounts for the greatest variability among the predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or False: PCA considers the target variable during analysis.

A

False. PCA acts solely on the predictor variables and ignores the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What should be done to predictors before applying PCA?

A

The predictors should be either standardized or normalized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: The total variability produced by the complete set of m predictors can often be mostly accounted for by a smaller set of k < m __________.

A

[components]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is PCA?

A

Principal Component Analysis (PCA) is a technique for dimension reduction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does PCA act on?

A

PCA acts solely on the predictor variables and ignores the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the characteristic of the first principal component?

A

The first principal component accounts for greater variability among the predictors than any other component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does the second principal component relate to the first?

A

The second principal component accounts for the second-most variability and is uncorrelated with the first.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the purpose of varimax rotation in PCA?

A

Varimax rotation helps in the interpretability of the components.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the cumulative variance explained by the first two components in the example?

A

The first two components account for about 52.2% of the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What does the eigenvalue criterion suggest for extracting components?

A

Only components with eigenvalues greater than one should be retained.

26
Q

What is the eigenvalue of 1.0 indicative of?

A

An eigenvalue of 1.0 indicates that the component explains about ‘one predictor’s worth’ of variability.

27
Q

What is the proportion of variance explained criterion?

A

The criterion specifies the proportion of total variability that the principal components should account for.

28
Q

How many components should be extracted based on the consensus in the example?

A

k = 4 components should be extracted.

29
Q

What does the principal component matrix indicate after rotation?

A

The rotated component matrix provides clearer interpretations of the principal components.

30
Q

What should be done to validate PCA results?

A

The PCA results should be validated using the test data set.

31
Q

What is indicated by VIFs equal to 1 in the regression of Sales per Visit on principal components?

A

All VIFs equal 1 indicate the elimination of multicollinearity.

32
Q

What is the purpose of standardizing predictor variables before PCA?

A

Standardizing ensures that all predictors contribute equally to the analysis.

33
Q

What command is used to perform PCA in Python?

A

The PCA() command from the sklearn.decomposition module is used.

34
Q

What is the significance of the cutoff in PCA loadings in R?

A

The cutoff suppresses small PCA weights to enhance interpretability.

35
Q

What is the first step in performing PCA using R?

A

Import the required data sets and separate the predictor variables.

36
Q

True or False: PCA can increase multicollinearity in the data.

37
Q

What does the command ‘scale()’ do in R?

A

It standardizes the predictor variables.

38
Q

Fill in the blank: The first principal component is a combination of _______.

A

Different Items Purchased and Purchase Visits

39
Q

What is the minimum number of predictors needed for eigenvalue criterion to be valid?

A

At least 20 predictors.

40
Q

What is the purpose of the varimax rotation in PCA?

A

To simplify the interpretation of the components by maximizing variance among them.

Varimax rotation is a method used in factor analysis to achieve a simpler and more interpretable factor structure.

41
Q

What command is used to obtain the correlation of the components in the training data set?

A

round(cor())

This command calculates the correlation matrix for the scores obtained from PCA.

42
Q

What does the VIF stand for in regression analysis?

A

Variance Inflation Factor

VIF measures how much the variance of an estimated regression coefficient increases when your predictors are correlated.

43
Q

True or False: Multicollinearity adversely affects the ability of the sample regression equation to predict the response variable.

A

False

Multicollinearity does not significantly impact the predictive ability of the regression model.

44
Q

What should be strictly limited when using a multicollinear model?

A

Estimation and prediction of the target variable

Interpretation of the model is inappropriate due to nonsensical individual coefficients.

45
Q

What is the first step in running a regression model after obtaining PCA scores?

A

Save each component as its own variable

This allows for the use of PCA components as predictors in the regression model.

46
Q

Fill in the blank: PCA replaces the original set of m predictors with _______.

A

principal components

Principal components are linear combinations of the original variables that capture the maximum variance.

47
Q

What does the eigenvalue criterion help determine in PCA?

A

The number of components to retain based on eigenvalues greater than 1

Components with eigenvalues greater than 1 explain more variance than a single original variable.

48
Q

What is the proportion of variance explained criterion in PCA?

A

It helps determine the number of components to retain by explaining a specified percentage of total variance.

Common thresholds are 70%, 80%, or 90% of total variance.

49
Q

What does a high VIF indicate in regression analysis?

A

That multicollinearity is a problem

A VIF value above 5 or 10 is often considered indicative of multicollinearity.

50
Q

What is the relationship between the first and other principal components?

A

Some principal components may be correlated with the first principal component.

Correlation among components can indicate shared variance or underlying relationships.

51
Q

How is multicollinearity defined in regression analysis?

A

The presence of high correlations among predictor variables

Multicollinearity can inflate the variance of coefficient estimates, making them unstable.

52
Q

What does PCA stand for?

A

Principal Component Analysis

PCA is a technique used for dimensionality reduction while preserving as much variance as possible.

53
Q

Which command is used to run a linear regression model in R?

A

lm()

The lm() function is used to fit linear models in R.

54
Q

What is the significance of the output from the vif() command?

A

It shows the variance inflation factors for the regression model.

VIF output helps identify potential multicollinearity issues among predictors.

55
Q

What should be done to predictors before running PCA?

A

Standardize or normalize them

Standardization ensures that each predictor contributes equally to the analysis.

56
Q

What does the correlation matrix show in PCA analysis?

A

The relationships between predictor variables

High correlations among variables indicate redundancy and potential multicollinearity.

57
Q

What is the target variable in the analysis of the red wine dataset?

A

Wine quality

This dataset typically includes various chemical properties of red wine as predictors.

58
Q

How can the number of components to retain be determined?

A

By combining recommendations from the eigenvalue criterion and the proportion of variance explained criterion.

This approach ensures a comprehensive evaluation of component significance.

59
Q

What does the term ‘high dimensionality’ refer to in data science?

A

Having a large number of features or predictors in a dataset

High dimensionality can lead to overfitting and increased computational cost.

60
Q

What is the main purpose of dimension reduction methods?

A

To reduce the number of predictors while retaining essential information

This helps improve model performance and interpretability.

61
Q

What does a plot of eigenvalues help determine?

A

How many components to retain based on their explained variance

The eigenvalue plot visually represents the variance accounted for by each component.

62
Q

What is the correlation matrix used for in the context of predictors?

A

To identify highly correlated variables

Understanding correlations helps in diagnosing multicollinearity issues.