Topics 5-9 Flashcards

1
Q

What are popular selection methods? And when do you use them?

A
  • Best subset selection
  • Stepwise selection

When N is not much bigger than p, this results in high variance and poor test error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When can forward stepwise selection be used and backward stepwise cannot?

A

When the n<p, since backward stepwise starts with a k-1 parameter size model and modeling a p>n model is not possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are 2 approaches of determining the best predicition accuracy of one of the subsets?

A

1: Indirectly estimate test error by making adjustment to training error to account for bias due to overfitting (AIC, BIC, Cp, Adjusted R^2)
2: Directly estimating the test error using Cross validation or validation set approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain what AIC, BIC, Cp and R^2 do, and how they are related and have special connections

A

AIC, BIC and Cp, smaller value indicates lower test error
higher adjusted R2 means lower test error
BIC places a higher penealty on # of predictors than AIC
R2 is not as well proven as the other 3
Cp is the same as AIC for linear regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the advantage of finding the test error using CV over indirectly estimating it?

A

It makes fewer assumptions about the true underlying model. The main reason for the AIC,BIC, Cp and R2 was low computational power before, now CV is more attractive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the differences between Shrinkage methods and Subset Selection?

A

In subset selection you eleminate predcitors in your model entirely, in shrinkage you change the parameter coefficients towards 0 if they are not significant to the response variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the 2 popular shrinkage methods?

A

-Ridge Selection
-Lasso

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain to yourself how Ridge Selection (regression) Works

A

It works by limiting the values the parameters can take, pushing some close to 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a downside of ridge regression?

A

That it will never push any parameter coefficient to 0, making results hard to interpret (inference)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When is either Ridge or Lasso regression preferable?

A

When a lot of the data is not usefull Lasso is better, when we know all the data is usefull Ridge is better

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the main methods to improve OLS fitting?

A

Subset Selection
Shrinkage Methods
Dimension Reduction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Effect of the Tuning Parameter (λ) in Ridge and Lasso:

A

Ridge Regression:
As λ increases, coefficients shrink towards zero but are never exactly zero.
Controls multicollinearity and improves prediction accuracy for large p
Lasso Regression:
As λ increases, some coefficients shrink exactly to zero, promoting sparsity and variable selection​.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an internal node, and what is a terminal node?

A

Internal Node: Represents a split in the data based on a predictor variable and a threshold. It partitions the predictor space into two regions.
Terminal Node (Leaf): Represents the end of a branch in the tree, where predictions are made. It contains the average response value of all observations that fall into that region​.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the pros and cons of tree methods?

A

Pros:
Easy to explain and interpret.
Handles non-linear relationships well.
Works with qualitative predictors without creating dummy variables.
Cons:
Prone to overfitting.
High variance: small changes in data can lead to different trees.
Generally less accurate than other advanced methods​​.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the criterion used in each splitting (classification trees)?

A

Gini Index: Measures node purity. Smaller values indicate a purer node.
Cross-Entropy: Measures the uncertainty in node class probabilities.
Classification Error Rate: Proportion of misclassified observations (less commonly used for splitting as it’s less sensitive)​.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to conduct tree pruning?

A

Grow a large tree: Fit a tree with many terminal nodes (leaves).
Cost Complexity Pruning:
Calculate a sequence of subtrees indexed by α, the penalty parameter for tree complexity.
Use cross-validation to choose the subtree with the lowest test error.
Select the optimal subtree and refit on the full training data​.

17
Q

Does random forests perform better than bagging?

A

Yes, typically, because decorrelating the trees leads to a greater reduction in variance, which improves predictive accuracy​.

18
Q

What value of m (number of predictors) should be used in random forests?

A

For classification:
m=root(p)
For regression: m=p/3

19
Q

What is a margin

A

The margin is the perpendicular distance between the separating hyperplane and the closest data points (called support vectors).
The goal of SVM is to maximize this margin to ensure a robust boundary between classes​.

20
Q

What is a maximal margin classifier?

A

The maximal margin classifier finds the hyperplane that maximizes the margin between two classes.
It works under the assumption that the data is linearly separable

21
Q

What are the support vectors of the support vector classifier?

A

Support vectors for the support vector classifier include:
Points lying on the margin.
Points violating the margin (on the wrong side).
Points misclassified (on the wrong side of the hyperplane)​.

22
Q

In using SVM with a polynomial kernel, what are the tuning parameters?

A

C: Regularization parameter.
d: Degree of the polynomial.
c: Constant term in the kernel function​.

23
Q

What is the relationship between maximal margin classifier, support vector classifier, and SVM?

A

Maximal Margin Classifier:
For linearly separable data.
Maximizes the margin without allowing misclassification.
Support Vector Classifier:
Extends the maximal margin classifier to non-separable cases using a soft margin.
Linear decision boundary with slack variables.
SVM:
Extends the support vector classifier to handle non-linear decision boundaries using kernel functions​.