Topics 5-9 Flashcards

Question 1

Q

What are popular selection methods? And when do you use them?

Answer

A

Best subset selection
Stepwise selection

When N is not much bigger than p, this results in high variance and poor test error

Question 2

Q

When can forward stepwise selection be used and backward stepwise cannot?

Answer

A

When the n<p, since backward stepwise starts with a k-1 parameter size model and modeling a p>n model is not possible

Question 3

Q

What are 2 approaches of determining the best predicition accuracy of one of the subsets?

Answer

A

1: Indirectly estimate test error by making adjustment to training error to account for bias due to overfitting (AIC, BIC, Cp, Adjusted R^2)
2: Directly estimating the test error using Cross validation or validation set approach

Question 4

Q

Explain what AIC, BIC, Cp and R^2 do, and how they are related and have special connections

Answer

A

AIC, BIC and Cp, smaller value indicates lower test error
higher adjusted R2 means lower test error
BIC places a higher penealty on # of predictors than AIC
R2 is not as well proven as the other 3
Cp is the same as AIC for linear regression

Question 5

Q

What is the advantage of finding the test error using CV over indirectly estimating it?

Answer

A

It makes fewer assumptions about the true underlying model. The main reason for the AIC,BIC, Cp and R2 was low computational power before, now CV is more attractive

Question 6

Q

What are the differences between Shrinkage methods and Subset Selection?

Answer

A

In subset selection you eleminate predcitors in your model entirely, in shrinkage you change the parameter coefficients towards 0 if they are not significant to the response variable.

Question 7

Q

What are the 2 popular shrinkage methods?

Answer

A

-Ridge Selection
-Lasso

Question 8

Q

Explain to yourself how Ridge Selection (regression) Works

Answer

A

It works by limiting the values the parameters can take, pushing some close to 0.

Question 9

Q

What is a downside of ridge regression?

Answer

A

That it will never push any parameter coefficient to 0, making results hard to interpret (inference)

Question 10

Q

When is either Ridge or Lasso regression preferable?

Answer

A

When a lot of the data is not usefull Lasso is better, when we know all the data is usefull Ridge is better

Question 11

Q

What are the main methods to improve OLS fitting?

Answer

A

Subset Selection
Shrinkage Methods
Dimension Reduction

Question 12

Q

Effect of the Tuning Parameter (λ) in Ridge and Lasso:

Answer

A

Ridge Regression:
As λ increases, coefficients shrink towards zero but are never exactly zero.
Controls multicollinearity and improves prediction accuracy for large p
Lasso Regression:
As λ increases, some coefficients shrink exactly to zero, promoting sparsity and variable selection.

Question 13

Q

What is an internal node, and what is a terminal node?

Answer

A

Internal Node: Represents a split in the data based on a predictor variable and a threshold. It partitions the predictor space into two regions.
Terminal Node (Leaf): Represents the end of a branch in the tree, where predictions are made. It contains the average response value of all observations that fall into that region.

Question 14

Q

What are the pros and cons of tree methods?

Answer

A

Pros:
Easy to explain and interpret.
Handles non-linear relationships well.
Works with qualitative predictors without creating dummy variables.
Cons:
Prone to overfitting.
High variance: small changes in data can lead to different trees.
Generally less accurate than other advanced methods.

Question 15

Q

What is the criterion used in each splitting (classification trees)?

Answer

A

Gini Index: Measures node purity. Smaller values indicate a purer node.
Cross-Entropy: Measures the uncertainty in node class probabilities.
Classification Error Rate: Proportion of misclassified observations (less commonly used for splitting as it’s less sensitive).

Question 16

Q

How to conduct tree pruning?

Answer

Study These Flashcards

A

Grow a large tree: Fit a tree with many terminal nodes (leaves).
Cost Complexity Pruning:
Calculate a sequence of subtrees indexed by α, the penalty parameter for tree complexity.
Use cross-validation to choose the subtree with the lowest test error.
Select the optimal subtree and refit on the full training data.

Question 17

Q

Does random forests perform better than bagging?

Answer

Study These Flashcards

A

Yes, typically, because decorrelating the trees leads to a greater reduction in variance, which improves predictive accuracy.

Question 18

Q

What value of m (number of predictors) should be used in random forests?

Answer

Study These Flashcards

A

For classification:
m=root(p)
For regression: m=p/3

Question 19

Q

What is a margin

Answer

Study These Flashcards

A

The margin is the perpendicular distance between the separating hyperplane and the closest data points (called support vectors).
The goal of SVM is to maximize this margin to ensure a robust boundary between classes.

Question 20

Q

What is a maximal margin classifier?

Answer

Study These Flashcards

A

The maximal margin classifier finds the hyperplane that maximizes the margin between two classes.
It works under the assumption that the data is linearly separable

Question 21

Q

What are the support vectors of the support vector classifier?

Answer

Study These Flashcards

A

Support vectors for the support vector classifier include:
Points lying on the margin.
Points violating the margin (on the wrong side).
Points misclassified (on the wrong side of the hyperplane).

Question 22

Q

In using SVM with a polynomial kernel, what are the tuning parameters?

Answer

Study These Flashcards

A

C: Regularization parameter.
d: Degree of the polynomial.
c: Constant term in the kernel function.

Question 23

Q

What is the relationship between maximal margin classifier, support vector classifier, and SVM?

Answer

Study These Flashcards

A

Maximal Margin Classifier:
For linearly separable data.
Maximizes the margin without allowing misclassification.
Support Vector Classifier:
Extends the maximal margin classifier to non-separable cases using a soft margin.
Linear decision boundary with slack variables.
SVM:
Extends the support vector classifier to handle non-linear decision boundaries using kernel functions.

Topics 5-9 Flashcards

(23 cards)