Section 2: Specific Types of Models Flashcards

1
Q

Independence Assumption for LMs/GLMs

A

Given the predictor values, the observations of the target variable are independent (same for both LMs/GLMs)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Target Distribution assumptions for LMs and GLMs

A

LMs: Given the predictor values, the target variable follows a normal distribution
GLMs: Given the predictor values, the target distribution is a member of the linear exponential family

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Mean assumptions for LMs and GLMs

A

LMs: the target mean directly equals the linear predictor (mu = B0 + B1X1+ … + BpXp)
GLMs: A function (“link”) of the target mean equals the linear predictor ( g(mu) = n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variance assumptions for LMs and GLMs

A

LM: constant, regardless of the predictor values
GLM: varies with mu and the predictor values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is a target distribution?

A

A distribution in the linear exponential family; choose one that aligns with the characteristics of the target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Important considerations when choosing a link function

A

1) ensure the predictions match the range of values of the target mean
2) ensure ease of interpretation (log link)
3) canonical links make convergence more likely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common distributions

A

Normal, Binomial, Poisson, Gamma, inverse gaussian, tweedie

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Normal distribution variable type and common link

A

real-valued with a bell-shaped dist.

identity link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Binomial variable type and common link

A

Binary (0/1)

logit link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Poisson variable type and common link

A

Count (>=0, integers)

Log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Gamma, inverse gaussian variable type and common link

A

positive, continuous with right skew

log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

tweedie variable type and common link

A

> = 0, continuous with a large mass at zero

log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

methods for handling non-monotonic relations

A

GLMs, in their basic form, assume that numeric predictors have a monotonic relationship with the target variable

1) polynomial regression
2) binning
3) piecewise linear functions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

polynomial regression

A

add polynomial terms to the model equation

pros: can take care of more complex relationships between the target variable and predictors. the more polynomial terms included, the more flexible the fit

cons: a) coefficients become harder to interpret (all polynomial terms move together) b) usually no clear choice in highest power; can be tuned by CV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Binning

A

“bin” the numeric variable and convert it into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable

pros: no definite order among the coefficients of the dummy variables corresponding to different bins -> target mean can vary highly irregularly over the bins

cons: a) usually no clear choice of the no. of bins and the associated boundaries b) results in a loss of information (exact values of the numeric predictor gone)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

adding piecewise linear functions

A

add features of the form (X-c)+

pros: a simple way to allow the relationship between a numeric variable and the target mean to vary over different intervals

cons: usually no clear choice of the break points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Handling categorical predictions - binarization

A

how it works: a categorical predictor becomes a collection of dummy (binary) variables indicating one and only one level and the dummy variables serve as predictors in model equation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

baseline level

A

the level at which all dummy variables equal 0

R’s default: the alpha-numerically first level
Good practice: reset it to the most common level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

interactions

A

need to “manually” include interaction terms of the product form XiXj, where the coefficient of Xi will vary with the value of Xj

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

interpretation of coefficients

A

coefficient estimates capture the effect (magnitude + direction) of features on the target mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

p-value statistical significance

A

the smaller the p-value, the more significant the feature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Offset: form of target variable and how they affect the mean/var of the target

A

form: aggregate (e.g., total number of claims in a group of similar policyholders)

affect: target mean is directly proportional to exposure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Weights: form of target variable and how they affect the mean/var of the target

A

form: average (e.g., average number of claims in a group of similar policyholders)

affect: variance is inversely related to exposure. observations with a larger exposure will play a more important role in model fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

stepwise selection

A

sequentially add/drop features, one at a time, until there is no improvement in the selection criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Forward selection
start with intercept-only model, add variables until no improvement in model tends to produce a simpler model
26
backward selection
starts with full model, drop variables until no improvement
27
selection criteria based on penalized likelihood
idea: prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount common choices are AIC and BIC
28
AIC
AIC = -2l + 2(p+1) penalty per parameter = 2
29
BIC
BIC = -2l + ln(n)*(p+1) penalty per parameter = ln(n)
30
AIC vs BIC
for both, the lower the value, the better BIC is more conservative and results in simpler models
31
Manual Binarization
convert factor variables to dummy variables manually before running stepwise selection pros: to be able to add/drop individual factor levels that are statistically significant/insignificant with regards to the baseline cons: more steps in stepAIC() procedure; possibly non-intuitive results (e.g., only a few levels of a factor are retained)
32
regularization
idea: reduce overfitting by shrinking the size of the coefficient estimates, especially those of non-predictive features
33
how does regularization work?
to optimize training log-likelihood (equivalently, training deviance) adjusted by a penalty term that reflects the size of the coefficients, i.e., to minimize deviance + regularization penalty the formulation serves to strike a balance between goodness of fit and model complexity
34
common forms of penalty term
1) Lasso - some coef. may be zero 2) ridge regression - none reduced to zero 3) elastic net - some coef. may be zero
35
Two hyperparamters
1) Lambda: regularization (a.k.a. shrinkage) parameter 2) alpha: mixing parameter
36
lambda
a) controls the amount of regularization (bigger lambda, more shrinkage, less complexity, squared bias increases and variance decreases) b) feature selection property: for elastic nets with alpha > 0 (lasso in particular) some coefficient estimates become exactly zero when lambda is large enough c) typically tuned by CV: choose lambda with the smallest CV error
37
Alpha
a) controls the mix between ridge (alpha=0) and lasso (alpha=1) b) provided that lambda is large enough, increasing alpha from 0 to 1 makes more coefficient estimates zero c) cannot be tuned by cv.glmnet(); need to tune manually
38
Single decision Trees: basics
idea: divide the feature space into a set of non-overlapping regions containing relatively homogeneous observations w.r.t target deliverable: a set of classification rules based on the values/levels of predictors and represented in the form of a “tree” predictions: observations in the same terminal node share the same predicted mean (for numeric targets) or same predicted class (for categorical targets)
39
recursive binary splitting
Greedy: at each step, adopt the split that leads to the greatest reduction in impurity at that point, instead of looking ahead and selecting a split that results in a better future step top-down: start from the “top” of the tree, go “down” and sequentially partition the feature space in a series of splits
40
node impurity measure properties
a) the smaller, the purer the observations in the node b) gini index and entropy are similar numerically c) gini index and entropy are more sensitive to node impurity than classification error rate
41
minbucket
minimum bucket size min # of obs. in a terminal node effect: higher, tree less complex
42
cp
complexity parameter min improvement required for a split to be made (not 100% right…) effect: higher, tree less complex
43
maxdepth
maximum depth no. of edges from root node to furthest node effect: higher, tree more complex
44
What can be used to tune cp?
CV within rpart()
45
what can be used to tune maxdepth and minbucket?
must be tuned by trial and error
46
What can you comment on when interpreting trees?
1) number of tree splits 2) split sequence, e.g., start with X1, further split the larger bucket by X2 3) which are the most important predictors (usually those in the early splits)? 4) which terminal nodes have the most observations? any sparse nodes? 5) any prominent interactions? 6) (classification trees) combinations leading to the positive event
47
Cost-complexity pruning rationale
to reduce tree complexity by pruning branches from the bottom that do no improve goodness of fit by a sufficient amount -> prevent overfitting and ease interpretation
48
What is the process of cost-complexity pruning?
step 1) grow a large tree T0 step 2) minimize the penalized objective function = relative training error + Cpx|T| ,(tree complexity) training error = {RSS, for regression # misclassification for classification}
49
what is the relationship between cp and the complexity of a tree?
as cp increases the tree is less complex (smaller)
50
what is the alternative to cost-complexity pruning?
One-standard-error (1-SE) rule how: select a simpler and more interpretable tree with comparable prediction performance
51
does transforming the target variable affect GLMs or Trees?
GLMs: Yes, the transformations alter the values of the predictors and target variable that go into the likelihood function Trees: Yes, the transformations can alter the calculations of node impurity measures, e.g., RSS, that define the tree splits
52
does transforming the predictors affect GLMs or Trees?
GLMs: Yes, same reasoning target variable Trees: Yes, unless the transformations are monotonic, e.g., log (monotonic transformations will not change the way tree splits are made.)
53
random forests
(variance reduction) combine the results of multiple trees fitted to different bootstrapped training samples in parallel -> reduce variance of overall predictions (randomization) take a random sample of predictors as candidates for each split -> reduce correlation between base trees -> further reduce variance of overall predictions
54
key parameters of random forests
1) mtry: # of features sampled as candidates at each split 2) ntree: # of trees to be grown
55
mtry general info
a) lower mtry -> greater variance reduction b) common choice: sqrt(p) (classification) or p/3 (regression) c) typically tuned by CV
56
ntree general info
a) higher ntree, more variance reductions b) often overfitting does not arise even if set to a large number c) set to a relatively small value to save run time
57
Boosting
in each iteration, fit a tree to the residuals of the preceding tree and subtract a scaled-down version of the current tree's predictions from the residuals to form the new residuals each tree focuses on obs.'s the previous tree predicted poorly
58
Key parameters of boosting
1) eta: learning rate (or shrinkage) parameter 2) nrounds: max # of rounds in the tree construction process
59
eta general info
a) effects of eta: higher eta -> algorithm converges faster but is more prone to overfitting b) rule of thumb: set to a relatively small value
60
nrounds general info
a) effects of nrounds: higher nrounds -> algorithm learns better but is more prone to overfitting b) rule of thumb: set to a relatively large value
61
fitting process for random forest vs boosting
random forest: in parallel boosting: in series (sequential)
62
focus for random forest vs boosting
random forest: variance boosting: bias
63
overfitting for random forest vs boosting
random forest: less vulnerable boosting: more vulnerable
64
hyperparameter for random forest vs boosting
random forest: less sensitive boosting: more sensitive
65
what are the two interpretation tools for ensemble trees?
variance importance plots and partial dependence plots
66
Variable importance plots
definition of importance scores: the total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees use: to identify important variables (those with a large score) limitation: unclear how important variables affect the target
67
partial dependence plots
definition of partial dependence: model predictions obtained after averaging the values/levels of variables not of interest use: Plot PD(X1) against various X1 to show the marginal effect of X1 on the target variable limitations: a) assume predictor of interest is independent of other predictors b) some predictions may be based on practically unreasonable combinations of predictors values
68
Pros of GLMS
1) (target distribution) GLMs excel in accommodating a wide variety of distributions for the target variable 2) (interpretability) the model equation clearly shows how the target mean depends on the features; coefficients = interpretable measure of directional effects of features 3) (implementation) simple to implement
69
Cons of GLMS
1) (complex relationships) unable to capture non-monotonic (e.g., polynomial) or non-additive relationships (e.g., interaction), unless additional features are manually incorporated 2) (interpretability) for some link functions (e.g., inverse link), the coefficients may be difficult to interpret
70
Pros of regularized GLMs
1) (categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically and each factor level is treated as a separate feature to be removed 2) (Tuning) an elastic net can be tuned by CV using the same criterion (e.g., MSE, accuracy) ultimately used to judge the model against unseen test data 3) (variable selection) for elastic nets with alpha > 0, variable selection can be done by making lambda large enough
71
Cons of regularized GLMs
1) (categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected 2) (target distribution) limited/restricted model forms allowed by glmnet() (WEAK POINT!) 3) (interpretability) coefficient estimates are more difficult to interpret when variables are standardized (WEAK POINT!)
72
pros of single trees
1) (interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation 2) (complex relationships) trees excel in handling non-monotonic and non-additive relationships without the need to insert extra features manually 3) (categorical variables) categorical predictors are automatically handled by separating their levels into two groups without the need for binarization 4) (variable selection) variables are automatically selected as part of the model-building process. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree
73
cons of single trees
1) (overfitting) strongly dependent on training data (prone to overfitting) -> predictions unstable with a high variance -> lower user confidence 2) (numeric variables) usually need to split based on a numeric predictor repeatedly to capture its effect effectively -> tree becomes large, difficult to interpret 3) (categorical variables) tend to favor categorical predictors with a large no. of levels
74
pros of ensemble trees
1) much more robust and predictive than base trees by combining the results of multiple trees
75
cons of ensemble trees
1) Opaque ("black box"), difficult to interpret 2) computationally prohibitive to implement