Section 2: Specific Types of Models Flashcards
Independence Assumption for LMs/GLMs
Given the predictor values, the observations of the target variable are independent (same for both LMs/GLMs)
Target Distribution assumptions for LMs and GLMs
LMs: Given the predictor values, the target variable follows a normal distribution
GLMs: Given the predictor values, the target distribution is a member of the linear exponential family
Mean assumptions for LMs and GLMs
LMs: the target mean directly equals the linear predictor (mu = B0 + B1X1+ … + BpXp)
GLMs: A function (“link”) of the target mean equals the linear predictor ( g(mu) = n)
Variance assumptions for LMs and GLMs
LM: constant, regardless of the predictor values
GLM: varies with mu and the predictor values
what is a target distribution?
A distribution in the linear exponential family; choose one that aligns with the characteristics of the target
Important considerations when choosing a link function
1) ensure the predictions match the range of values of the target mean
2) ensure ease of interpretation (log link)
3) canonical links make convergence more likely
Common distributions
Normal, Binomial, Poisson, Gamma, inverse gaussian, tweedie
Normal distribution variable type and common link
real-valued with a bell-shaped dist.
identity link
Binomial variable type and common link
Binary (0/1)
logit link
Poisson variable type and common link
Count (>=0, integers)
Log link
Gamma, inverse gaussian variable type and common link
positive, continuous with right skew
log link
tweedie variable type and common link
> = 0, continuous with a large mass at zero
log link
methods for handling non-monotonic relations
GLMs, in their basic form, assume that numeric predictors have a monotonic relationship with the target variable
1) polynomial regression
2) binning
3) piecewise linear functions
polynomial regression
add polynomial terms to the model equation
pros: can take care of more complex relationships between the target variable and predictors. the more polynomial terms included, the more flexible the fit
cons: a) coefficients become harder to interpret (all polynomial terms move together) b) usually no clear choice in highest power; can be tuned by CV
Binning
“bin” the numeric variable and convert it into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable
pros: no definite order among the coefficients of the dummy variables corresponding to different bins -> target mean can vary highly irregularly over the bins
cons: a) usually no clear choice of the no. of bins and the associated boundaries b) results in a loss of information (exact values of the numeric predictor gone)
adding piecewise linear functions
add features of the form (X-c)+
pros: a simple way to allow the relationship between a numeric variable and the target mean to vary over different intervals
cons: usually no clear choice of the break points
Handling categorical predictions - binarization
how it works: a categorical predictor becomes a collection of dummy (binary) variables indicating one and only one level and the dummy variables serve as predictors in model equation
baseline level
the level at which all dummy variables equal 0
R’s default: the alpha-numerically first level
Good practice: reset it to the most common level
interactions
need to “manually” include interaction terms of the product form XiXj, where the coefficient of Xi will vary with the value of Xj
interpretation of coefficients
coefficient estimates capture the effect (magnitude + direction) of features on the target mean
p-value statistical significance
the smaller the p-value, the more significant the feature
Offset: form of target variable and how they affect the mean/var of the target
form: aggregate (e.g., total number of claims in a group of similar policyholders)
affect: target mean is directly proportional to exposure
Weights: form of target variable and how they affect the mean/var of the target
form: average (e.g., average number of claims in a group of similar policyholders)
affect: variance is inversely related to exposure. observations with a larger exposure will play a more important role in model fitting
stepwise selection
sequentially add/drop features, one at a time, until there is no improvement in the selection criterion
Forward selection
start with intercept-only model, add variables until no improvement in model
tends to produce a simpler model
backward selection
starts with full model, drop variables until no improvement
selection criteria based on penalized likelihood
idea: prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount
common choices are AIC and BIC
AIC
AIC = -2l + 2(p+1)
penalty per parameter = 2
BIC
BIC = -2l + ln(n)*(p+1)
penalty per parameter = ln(n)
AIC vs BIC
for both, the lower the value, the better
BIC is more conservative and results in simpler models
Manual Binarization
convert factor variables to dummy variables manually before running stepwise selection
pros: to be able to add/drop individual factor levels that are statistically significant/insignificant with regards to the baseline
cons: more steps in stepAIC() procedure; possibly non-intuitive results (e.g., only a few levels of a factor are retained)
regularization
idea: reduce overfitting by shrinking the size of the coefficient estimates, especially those of non-predictive features
how does regularization work?
to optimize training log-likelihood (equivalently, training deviance) adjusted by a penalty term that reflects the size of the coefficients, i.e., to minimize
deviance + regularization penalty
the formulation serves to strike a balance between goodness of fit and model complexity
common forms of penalty term
1) Lasso - some coef. may be zero
2) ridge regression - none reduced to zero
3) elastic net - some coef. may be zero
Two hyperparamters
1) Lambda: regularization (a.k.a. shrinkage) parameter
2) alpha: mixing parameter
lambda
a) controls the amount of regularization (bigger lambda, more shrinkage, less complexity, squared bias increases and variance decreases)
b) feature selection property: for elastic nets with alpha > 0 (lasso in particular) some coefficient estimates become exactly zero when lambda is large enough
c) typically tuned by CV: choose lambda with the smallest CV error
Alpha
a) controls the mix between ridge (alpha=0) and lasso (alpha=1)
b) provided that lambda is large enough, increasing alpha from 0 to 1 makes more coefficient estimates zero
c) cannot be tuned by cv.glmnet(); need to tune manually
Single decision Trees: basics
idea: divide the feature space into a set of non-overlapping regions containing relatively homogeneous observations w.r.t target
deliverable: a set of classification rules based on the values/levels of predictors and represented in the form of a “tree”
predictions: observations in the same terminal node share the same predicted mean (for numeric targets) or same predicted class (for categorical targets)
recursive binary splitting
Greedy: at each step, adopt the split that leads to the greatest reduction in impurity at that point, instead of looking ahead and selecting a split that results in a better future step
top-down: start from the “top” of the tree, go “down” and sequentially partition the feature space in a series of splits
node impurity measure properties
a) the smaller, the purer the observations in the node
b) gini index and entropy are similar numerically
c) gini index and entropy are more sensitive to node impurity than classification error rate
minbucket
minimum bucket size
min # of obs. in a terminal node
effect: higher, tree less complex
cp
complexity parameter
min improvement required for a split to be made (not 100% right…)
effect: higher, tree less complex
maxdepth
maximum depth
no. of edges from root node to furthest node
effect: higher, tree more complex
What can be used to tune cp?
CV within rpart()
what can be used to tune maxdepth and minbucket?
must be tuned by trial and error
What can you comment on when interpreting trees?
1) number of tree splits
2) split sequence, e.g., start with X1, further split the larger bucket by X2
3) which are the most important predictors (usually those in the early splits)?
4) which terminal nodes have the most observations? any sparse nodes?
5) any prominent interactions?
6) (classification trees) combinations leading to the positive event
Cost-complexity pruning rationale
to reduce tree complexity by pruning branches from the bottom that do no improve goodness of fit by a sufficient amount -> prevent overfitting and ease interpretation
What is the process of cost-complexity pruning?
step 1) grow a large tree T0
step 2) minimize the penalized objective function = relative training error + Cpx|T| ,(tree complexity)
training error = {RSS, for regression
# misclassification for classification}
what is the relationship between cp and the complexity of a tree?
as cp increases the tree is less complex (smaller)
what is the alternative to cost-complexity pruning?
One-standard-error (1-SE) rule
how: select a simpler and more interpretable tree with comparable prediction performance
does transforming the target variable affect GLMs or Trees?
GLMs: Yes, the transformations alter the values of the predictors and target variable that go into the likelihood function
Trees: Yes, the transformations can alter the calculations of node impurity measures, e.g., RSS, that define the tree splits
does transforming the predictors affect GLMs or Trees?
GLMs: Yes, same reasoning target variable
Trees: Yes, unless the transformations are monotonic, e.g., log (monotonic transformations will not change the way tree splits are made.)
random forests
(variance reduction) combine the results of multiple trees fitted to different bootstrapped training samples in parallel -> reduce variance of overall predictions
(randomization) take a random sample of predictors as candidates for each split -> reduce correlation between base trees -> further reduce variance of overall predictions
key parameters of random forests
1) mtry: # of features sampled as candidates at each split
2) ntree: # of trees to be grown
mtry general info
a) lower mtry -> greater variance reduction
b) common choice: sqrt(p) (classification) or p/3 (regression)
c) typically tuned by CV
ntree general info
a) higher ntree, more variance reductions
b) often overfitting does not arise even if set to a large number
c) set to a relatively small value to save run time
Boosting
in each iteration, fit a tree to the residuals of the preceding tree and subtract a scaled-down version of the current tree’s predictions from the residuals to form the new residuals
each tree focuses on obs.’s the previous tree predicted poorly
Key parameters of boosting
1) eta: learning rate (or shrinkage) parameter
2) nrounds: max # of rounds in the tree construction process
eta general info
a) effects of eta: higher eta -> algorithm converges faster but is more prone to overfitting
b) rule of thumb: set to a relatively small value
nrounds general info
a) effects of nrounds: higher nrounds -> algorithm learns better but is more prone to overfitting
b) rule of thumb: set to a relatively large value
fitting process for random forest vs boosting
random forest: in parallel
boosting: in series (sequential)
focus for random forest vs boosting
random forest: variance
boosting: bias
overfitting for random forest vs boosting
random forest: less vulnerable
boosting: more vulnerable
hyperparameter for random forest vs boosting
random forest: less sensitive
boosting: more sensitive
what are the two interpretation tools for ensemble trees?
variance importance plots and partial dependence plots
Variable importance plots
definition of importance scores: the total drop in node impurity (RSS for regression trees and Gini index for classification trees) due to splits over a given predictor, averaged over all base trees
use: to identify important variables (those with a large score)
limitation: unclear how important variables affect the target
partial dependence plots
definition of partial dependence: model predictions obtained after averaging the values/levels of variables not of interest
use: Plot PD(X1) against various X1 to show the marginal effect of X1 on the target variable
limitations: a) assume predictor of interest is independent of other predictors
b) some predictions may be based on practically unreasonable combinations of predictors values
Pros of GLMS
1) (target distribution) GLMs excel in accommodating a wide variety of distributions for the target variable
2) (interpretability) the model equation clearly shows how the target mean depends on the features; coefficients = interpretable measure of directional effects of features
3) (implementation) simple to implement
Cons of GLMS
1) (complex relationships) unable to capture non-monotonic (e.g., polynomial) or non-additive relationships (e.g., interaction), unless additional features are manually incorporated
2) (interpretability) for some link functions (e.g., inverse link), the coefficients may be difficult to interpret
Pros of regularized GLMs
1) (categorical predictors) via the use of model matrices, binarization of categorical variables is done automatically and each factor level is treated as a separate feature to be removed
2) (Tuning) an elastic net can be tuned by CV using the same criterion (e.g., MSE, accuracy) ultimately used to judge the model against unseen test data
3) (variable selection) for elastic nets with alpha > 0, variable selection can be done by making lambda large enough
Cons of regularized GLMs
1) (categorical predictors) possible to see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected
2) (target distribution) limited/restricted model forms allowed by glmnet() (WEAK POINT!)
3) (interpretability) coefficient estimates are more difficult to interpret when variables are standardized (WEAK POINT!)
pros of single trees
1) (interpretability) if there are not too many buckets, trees are easy to interpret because of the if/then nature of the classification rules and their graphical representation
2) (complex relationships) trees excel in handling non-monotonic and non-additive relationships without the need to insert extra features manually
3) (categorical variables) categorical predictors are automatically handled by separating their levels into two groups without the need for binarization
4) (variable selection) variables are automatically selected as part of the model-building process. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree
cons of single trees
1) (overfitting) strongly dependent on training data (prone to overfitting) -> predictions unstable with a high variance -> lower user confidence
2) (numeric variables) usually need to split based on a numeric predictor repeatedly to capture its effect effectively -> tree becomes large, difficult to interpret
3) (categorical variables) tend to favor categorical predictors with a large no. of levels
pros of ensemble trees
1) much more robust and predictive than base trees by combining the results of multiple trees
cons of ensemble trees
1) Opaque (“black box”), difficult to interpret
2) computationally prohibitive to implement