Section 2: Specific Types of Models Flashcards

1
Q

Independence Assumption for LMs/GLMs

A

Given the predictor values, the observations of the target variable are independent (same for both LMs/GLMs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Target Distribution assumptions for LMs and GLMs

A

LMs: Given the predictor values, the target variable follows a normal distribution
GLMs: Given the predictor values, the target distribution is a member of the linear exponential family

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Mean assumptions for LMs and GLMs

A

LMs: the target mean directly equals the linear predictor (mu = B0 + B1X1+ … + BpXp)
GLMs: A function (“link”) of the target mean equals the linear predictor ( g(mu) = n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variance assumptions for LMs and GLMs

A

LM: constant, regardless of the predictor values
GLM: varies with mu and the predictor values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a target distribution?

A

A distribution in the linear exponential family; choose one that aligns with the characteristics of the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Important considerations when choosing a link function

A

1) ensure the predictions match the range of values of the target mean
2) ensure ease of interpretation (log link)
3) canonical links make convergence more likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common distributions

A

Normal, Binomial, Poisson, Gamma, inverse gaussian, tweedie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normal distribution variable type and common link

A

real-valued with a bell-shaped dist.

identity link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Binomial variable type and common link

A

Binary (0/1)

logit link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Poisson variable type and common link

A

Count (>=0, integers)

Log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gamma, inverse gaussian variable type and common link

A

positive, continuous with right skew

log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

tweedie variable type and common link

A

> = 0, continuous with a large mass at zero

log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

methods for handling non-monotonic relations

A

GLMs, in their basic form, assume that numeric predictors have a monotonic relationship with the target variable

1) polynomial regression
2) binning
3) piecewise linear functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

polynomial regression

A

add polynomial terms to the model equation

pros: can take care of more complex relationships between the target variable and predictors. the more polynomial terms included, the more flexible the fit

cons: a) coefficients become harder to interpret (all polynomial terms move together) b) usually no clear choice in highest power; can be tuned by CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Binning

A

“bin” the numeric variable and convert it into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable

pros: no definite order among the coefficients of the dummy variables corresponding to different bins -> target mean can vary highly irregularly over the bins

cons: a) usually no clear choice of the no. of bins and the associated boundaries b) results in a loss of information (exact values of the numeric predictor gone)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

adding piecewise linear functions

A

add features of the form (X-c)+

pros: a simple way to allow the relationship between a numeric variable and the target mean to vary over different intervals

cons: usually no clear choice of the break points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Handling categorical predictions - binarization

A

how it works: a categorical predictor becomes a collection of dummy (binary) variables indicating one and only one level and the dummy variables serve as predictors in model equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

baseline level

A

the level at which all dummy variables equal 0

R’s default: the alpha-numerically first level
Good practice: reset it to the most common level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

interactions

A

need to “manually” include interaction terms of the product form XiXj, where the coefficient of Xi will vary with the value of Xj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

interpretation of coefficients

A

coefficient estimates capture the effect (magnitude + direction) of features on the target mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

p-value statistical significance

A

the smaller the p-value, the more significant the feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Offset: form of target variable and how they affect the mean/var of the target

A

form: aggregate (e.g., total number of claims in a group of similar policyholders)

affect: target mean is directly proportional to exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Weights: form of target variable and how they affect the mean/var of the target

A

form: average (e.g., average number of claims in a group of similar policyholders)

affect: variance is inversely related to exposure. observations with a larger exposure will play a more important role in model fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

stepwise selection

A

sequentially add/drop features, one at a time, until there is no improvement in the selection criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Forward selection

A

start with intercept-only model, add variables until no improvement in model

tends to produce a simpler model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

backward selection

A

starts with full model, drop variables until no improvement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

selection criteria based on penalized likelihood

A

idea: prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount

common choices are AIC and BIC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

AIC

A

AIC = -2l + 2(p+1)

penalty per parameter = 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

BIC

A

BIC = -2l + ln(n)*(p+1)

penalty per parameter = ln(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

AIC vs BIC

A

for both, the lower the value, the better

BIC is more conservative and results in simpler models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Manual Binarization

A

convert factor variables to dummy variables manually before running stepwise selection

pros: to be able to add/drop individual factor levels that are statistically significant/insignificant with regards to the baseline

cons: more steps in stepAIC() procedure; possibly non-intuitive results (e.g., only a few levels of a factor are retained)

32
Q

regularization

A

idea: reduce overfitting by shrinking the size of the coefficient estimates, especially those of non-predictive features

33
Q

how does regularization work?

A

to optimize training log-likelihood (equivalently, training deviance) adjusted by a penalty term that reflects the size of the coefficients, i.e., to minimize
deviance + regularization penalty
the formulation serves to strike a balance between goodness of fit and model complexity

34
Q

common forms of penalty term

A

1) Lasso - some coef. may be zero
2) ridge regression - none reduced to zero
3) elastic net - some coef. may be zero

35
Q

Two hyperparamters

A

1) Lambda: regularization (a.k.a. shrinkage) parameter
2) alpha: mixing parameter

36
Q

lambda

A

a) controls the amount of regularization (bigger lambda, more shrinkage, less complexity, squared bias increases and variance decreases)
b) feature selection property: for elastic nets with alpha > 0 (lasso in particular) some coefficient estimates become exactly zero when lambda is large enough
c) typically tuned by CV: choose lambda with the smallest CV error

37
Q

Alpha

A

a) controls the mix between ridge (alpha=0) and lasso (alpha=1)
b) provided that lambda is large enough, increasing alpha from 0 to 1 makes more coefficient estimates zero
c) cannot be tuned by cv.glmnet(); need to tune manually

38
Q

Single decision Trees: basics

A

idea: divide the feature space into a set of non-overlapping regions containing relatively homogeneous observations w.r.t target

deliverable: a set of classification rules based on the values/levels of predictors and represented in the form of a “tree”

predictions: observations in the same terminal node share the same predicted mean (for numeric targets) or same predicted class (for categorical targets)

39
Q

recursive binary splitting

A

Greedy: at each step, adopt the split that leads to the greatest reduction in impurity at that point, instead of looking ahead and selecting a split that results in a better future step

top-down: start from the “top” of the tree, go “down” and sequentially partition the feature space in a series of splits

40
Q

node impurity measure properties

A

a) the smaller, the purer the observations in the node
b) gini index and entropy are similar numerically
c) gini index and entropy are more sensitive to node impurity than classification error rate

41
Q

minbucket

A

minimum bucket size

min # of obs. in a terminal node

effect: higher, tree less complex

42
Q

cp

A

complexity parameter

min improvement required for a split to be made (not 100% right…)

effect: higher, tree less complex

43
Q

maxdepth

A

maximum depth

no. of edges from root node to furthest node

effect: higher, tree more complex

44
Q

What can be used to tune cp?

A

CV within rpart()

45
Q

what can be used to tune maxdepth and minbucket?

A

must be tuned by trial and error

46
Q

What can you comment on when interpreting trees?

A

1) number of tree splits
2) split sequence, e.g., start with X1, further split the larger bucket by X2
3) which are the most important predictors (usually those in the early splits)?
4) which terminal nodes have the most observations? any sparse nodes?
5) any prominent interactions?
6) (classification trees) combinations leading to the positive event

47
Q

Cost-complexity pruning rationale

A

to reduce tree complexity by pruning branches from the bottom that do no improve goodness of fit by a sufficient amount -> prevent overfitting and ease interpretation

48
Q

What is the process of cost-complexity pruning?

A

step 1) grow a large tree T0
step 2) minimize the penalized objective function = relative training error + Cpx|T| ,(tree complexity)

training error = {RSS, for regression
# misclassification for classification}

49
Q

what is the relationship between cp and the complexity of a tree?

A

as cp increases the tree is less complex (smaller)

50
Q

what is the alternative to cost-complexity pruning?

A

One-standard-error (1-SE) rule

how: select a simpler and more interpretable tree with comparable prediction performance

51
Q

does transforming the target variable affect GLMs or Trees?

A

GLMs: Yes, the transformations alter the values of the predictors and target variable that go into the likelihood function

Trees: Yes, the transformations can alter the calculations of node impurity measures, e.g., RSS, that define the tree splits

52
Q

does transforming the predictors affect GLMs or Trees?

A

GLMs: Yes, same reasoning target variable

Trees: Yes, unless the transformations are monotonic, e.g., log (monotonic transformations will not change the way tree splits are made.)

53
Q

random forests

A

(variance reduction) combine the results of multiple trees fitted to different bootstrapped training samples in parallel -> reduce variance of overall predictions

(randomization) take a random sample of predictors as candidates for each split -> reduce correlation between base trees -> further reduce variance of overall predictions

54
Q

key parameters of random forests

A

1) mtry: # of features sampled as candidates at each split

2) ntree: # of trees to be grown

55
Q

mtry general info

A

a) lower mtry -> greater variance reduction
b) common choice: sqrt(p) (classification) or p/3 (regression)
c) typically tuned by CV

56
Q

ntree general info

A

a) higher ntree, more variance reductions
b) often overfitting does not arise even if set to a large number
c) set to a relatively small value to save run time

57
Q

Boosting

A

in each iteration, fit a tree to the residuals of the preceding tree and subtract a scaled-down version of the current tree’s predictions from the residuals to form the new residuals

each tree focuses on obs.’s the previous tree predicted poorly

58
Q

Key parameters of boosting

A

1) eta: learning rate (or shrinkage) parameter
2) nrounds: max # of rounds in the tree construction process

59
Q

eta general info

A

a) effects of eta: higher eta -> algorithm converges faster but is more prone to overfitting

b) rule of thumb: set to a relatively small value

60
Q

nrounds general info

A

a) effects of nrounds: higher nrounds -> algorithm learns better but is more prone to overfitting

b) rule of thumb: set to a relatively large value

61
Q

fitting process for random forest vs boosting

A

random forest: in parallel

boosting: in series (sequential)

62
Q

focus for random forest vs boosting

A

random forest: variance

boosting: bias

63
Q

overfitting for random forest vs boosting

A

random forest: less vulnerable

boosting: more vulnerable

64
Q

hyperparameter for random forest vs boosting

A

random forest: less sensitive

boosting: more sensitive

65
Q

what are the two interpretation tools for ensemble trees?

A

variance importance plots and partial dependence plots

66
Q

Variable importance plots

A

definition of importance scores: the total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees

use: to identify important variables (those with a large score)

limitation: unclear how important variables affect the target

67
Q

partial dependence plots

A

definition of partial dependence: model predictions obtained after averaging the values/levels of variables not of interest

use: Plot PD(X1) against various X1 to show the marginal effect of X1 on the target variable

limitations: a) assume predictor of interest is independent of other predictors
b) some predictions may be based on practically unreasonable combinations of predictors values

68
Q

Pros of GLMS

A

1) (target distribution) GLMs excel in accommodating a wide variety of distributions for the target variable
2) (interpretability) the model equation clearly shows how the target mean depends on the features; coefficients = interpretable measure of directional effects of features
3) (implementation) simple to implement

69
Q

Cons of GLMS

A

1) (complex relationships) unable to capture non-monotonic (e.g., polynomial) or non-additive relationships (e.g., interaction), unless additional features are manually incorporated
2) (interpretability) for some link functions (e.g., inverse link), the coefficients may be difficult to interpret

70
Q

Pros of regularized GLMs

A

1) (categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically and each factor level is treated as a separate feature to be removed
2) (Tuning) an elastic net can be tuned by CV using the same criterion (e.g., MSE, accuracy) ultimately used to judge the model against unseen test data
3) (variable selection) for elastic nets with alpha > 0, variable selection can be done by making lambda large enough

71
Q

Cons of regularized GLMs

A

1) (categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected
2) (target distribution) limited/restricted model forms allowed by glmnet() (WEAK POINT!)
3) (interpretability) coefficient estimates are more difficult to interpret when variables are standardized (WEAK POINT!)

72
Q

pros of single trees

A

1) (interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation
2) (complex relationships) trees excel in handling non-monotonic and non-additive relationships without the need to insert extra features manually
3) (categorical variables) categorical predictors are automatically handled by separating their levels into two groups without the need for binarization
4) (variable selection) variables are automatically selected as part of the model-building process. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree

73
Q

cons of single trees

A

1) (overfitting) strongly dependent on training data (prone to overfitting) -> predictions unstable with a high variance -> lower user confidence
2) (numeric variables) usually need to split based on a numeric predictor repeatedly to capture its effect effectively -> tree becomes large, difficult to interpret
3) (categorical variables) tend to favor categorical predictors with a large no. of levels

74
Q

pros of ensemble trees

A

1) much more robust and predictive than base trees by combining the results of multiple trees

75
Q

cons of ensemble trees

A

1) Opaque (“black box”), difficult to interpret
2) computationally prohibitive to implement