Section 2: Specific Types of Models Flashcards

Question

Forward selection

Answer 1

start with intercept-only model, add variables until no improvement in model tends to produce a simpler model

Answer 2

starts with full model, drop variables until no improvement

Answer 3

idea: prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount common choices are AIC and BIC

Answer 4

AIC = -2l + 2(p+1) penalty per parameter = 2

Answer 5

BIC = -2l + ln(n)*(p+1) penalty per parameter = ln(n)

Answer 6

for both, the lower the value, the better BIC is more conservative and results in simpler models

Answer 7

convert factor variables to dummy variables manually before running stepwise selection pros: to be able to add/drop individual factor levels that are statistically significant/insignificant with regards to the baseline cons: more steps in stepAIC() procedure; possibly non-intuitive results (e.g., only a few levels of a factor are retained)

Answer 8

idea: reduce overfitting by shrinking the size of the coefficient estimates, especially those of non-predictive features

Answer 9

to optimize training log-likelihood (equivalently, training deviance) adjusted by a penalty term that reflects the size of the coefficients, i.e., to minimize deviance + regularization penalty the formulation serves to strike a balance between goodness of fit and model complexity

Answer 10

1) Lasso - some coef. may be zero 2) ridge regression - none reduced to zero 3) elastic net - some coef. may be zero

Answer 11

1) Lambda: regularization (a.k.a. shrinkage) parameter 2) alpha: mixing parameter

Answer 12

a) controls the amount of regularization (bigger lambda, more shrinkage, less complexity, squared bias increases and variance decreases) b) feature selection property: for elastic nets with alpha > 0 (lasso in particular) some coefficient estimates become exactly zero when lambda is large enough c) typically tuned by CV: choose lambda with the smallest CV error

Answer 13

a) controls the mix between ridge (alpha=0) and lasso (alpha=1) b) provided that lambda is large enough, increasing alpha from 0 to 1 makes more coefficient estimates zero c) cannot be tuned by cv.glmnet(); need to tune manually

Answer 14

idea: divide the feature space into a set of non-overlapping regions containing relatively homogeneous observations w.r.t target deliverable: a set of classification rules based on the values/levels of predictors and represented in the form of a “tree” predictions: observations in the same terminal node share the same predicted mean (for numeric targets) or same predicted class (for categorical targets)

Answer 15

Greedy: at each step, adopt the split that leads to the greatest reduction in impurity at that point, instead of looking ahead and selecting a split that results in a better future step top-down: start from the “top” of the tree, go “down” and sequentially partition the feature space in a series of splits

Answer 16

a) the smaller, the purer the observations in the node b) gini index and entropy are similar numerically c) gini index and entropy are more sensitive to node impurity than classification error rate

Answer 17

minimum bucket size min # of obs. in a terminal node effect: higher, tree less complex

Answer 18

complexity parameter min improvement required for a split to be made (not 100% right…) effect: higher, tree less complex

Answer 19

maximum depth no. of edges from root node to furthest node effect: higher, tree more complex

Answer 20

CV within rpart()

Answer 21

must be tuned by trial and error

Answer 22

1) number of tree splits 2) split sequence, e.g., start with X1, further split the larger bucket by X2 3) which are the most important predictors (usually those in the early splits)? 4) which terminal nodes have the most observations? any sparse nodes? 5) any prominent interactions? 6) (classification trees) combinations leading to the positive event

Answer 23

to reduce tree complexity by pruning branches from the bottom that do no improve goodness of fit by a sufficient amount -> prevent overfitting and ease interpretation

Answer 24

step 1) grow a large tree T0 step 2) minimize the penalized objective function = relative training error + Cpx|T| ,(tree complexity) training error = {RSS, for regression # misclassification for classification}

Answer 25

as cp increases the tree is less complex (smaller)

Answer 26

One-standard-error (1-SE) rule how: select a simpler and more interpretable tree with comparable prediction performance

Answer 27

GLMs: Yes, the transformations alter the values of the predictors and target variable that go into the likelihood function Trees: Yes, the transformations can alter the calculations of node impurity measures, e.g., RSS, that define the tree splits

Answer 28

GLMs: Yes, same reasoning target variable Trees: Yes, unless the transformations are monotonic, e.g., log (monotonic transformations will not change the way tree splits are made.)

Answer 29

(variance reduction) combine the results of multiple trees fitted to different bootstrapped training samples in parallel -> reduce variance of overall predictions (randomization) take a random sample of predictors as candidates for each split -> reduce correlation between base trees -> further reduce variance of overall predictions

Answer 30

1) mtry: # of features sampled as candidates at each split 2) ntree: # of trees to be grown

Answer 31

a) lower mtry -> greater variance reduction b) common choice: sqrt(p) (classification) or p/3 (regression) c) typically tuned by CV

Answer 32

a) higher ntree, more variance reductions b) often overfitting does not arise even if set to a large number c) set to a relatively small value to save run time

Answer 33

in each iteration, fit a tree to the residuals of the preceding tree and subtract a scaled-down version of the current tree's predictions from the residuals to form the new residuals each tree focuses on obs.'s the previous tree predicted poorly

Answer 34

1) eta: learning rate (or shrinkage) parameter 2) nrounds: max # of rounds in the tree construction process

Answer 35

a) effects of eta: higher eta -> algorithm converges faster but is more prone to overfitting b) rule of thumb: set to a relatively small value

Answer 36

a) effects of nrounds: higher nrounds -> algorithm learns better but is more prone to overfitting b) rule of thumb: set to a relatively large value

Answer 37

random forest: in parallel boosting: in series (sequential)

Answer 38

random forest: variance boosting: bias

Answer 39

random forest: less vulnerable boosting: more vulnerable

Answer 40

random forest: less sensitive boosting: more sensitive

Answer 41

variance importance plots and partial dependence plots

Answer 42

definition of importance scores: the total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees use: to identify important variables (those with a large score) limitation: unclear how important variables affect the target

Answer 43

definition of partial dependence: model predictions obtained after averaging the values/levels of variables not of interest use: Plot PD(X1) against various X1 to show the marginal effect of X1 on the target variable limitations: a) assume predictor of interest is independent of other predictors b) some predictions may be based on practically unreasonable combinations of predictors values

Answer 44

1) (target distribution) GLMs excel in accommodating a wide variety of distributions for the target variable 2) (interpretability) the model equation clearly shows how the target mean depends on the features; coefficients = interpretable measure of directional effects of features 3) (implementation) simple to implement

Answer 45

1) (complex relationships) unable to capture non-monotonic (e.g., polynomial) or non-additive relationships (e.g., interaction), unless additional features are manually incorporated 2) (interpretability) for some link functions (e.g., inverse link), the coefficients may be difficult to interpret

Answer 46

1) (categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically and each factor level is treated as a separate feature to be removed 2) (Tuning) an elastic net can be tuned by CV using the same criterion (e.g., MSE, accuracy) ultimately used to judge the model against unseen test data 3) (variable selection) for elastic nets with alpha > 0, variable selection can be done by making lambda large enough

Answer 47

1) (categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected 2) (target distribution) limited/restricted model forms allowed by glmnet() (WEAK POINT!) 3) (interpretability) coefficient estimates are more difficult to interpret when variables are standardized (WEAK POINT!)

Answer 48

1) (interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation 2) (complex relationships) trees excel in handling non-monotonic and non-additive relationships without the need to insert extra features manually 3) (categorical variables) categorical predictors are automatically handled by separating their levels into two groups without the need for binarization 4) (variable selection) variables are automatically selected as part of the model-building process. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree

Answer 49

1) (overfitting) strongly dependent on training data (prone to overfitting) -> predictions unstable with a high variance -> lower user confidence 2) (numeric variables) usually need to split based on a numeric predictor repeatedly to capture its effect effectively -> tree becomes large, difficult to interpret 3) (categorical variables) tend to favor categorical predictors with a large no. of levels

Answer 50

1) much more robust and predictive than base trees by combining the results of multiple trees

Answer 51

1) Opaque ("black box"), difficult to interpret 2) computationally prohibitive to implement

Section 2: Specific Types of Models Flashcards

(75 cards)