Section 2: Specific Types of Models Flashcards
Independence Assumption for LMs/GLMs
Given the predictor values, the observations of the target variable are independent (same for both LMs/GLMs)
Target Distribution assumptions for LMs and GLMs
LMs: Given the predictor values, the target variable follows a normal distribution
GLMs: Given the predictor values, the target distribution is a member of the linear exponential family
Mean assumptions for LMs and GLMs
LMs: the target mean directly equals the linear predictor (mu = B0 + B1X1+ … + BpXp)
GLMs: A function (“link”) of the target mean equals the linear predictor ( g(mu) = n)
Variance assumptions for LMs and GLMs
LM: constant, regardless of the predictor values
GLM: varies with mu and the predictor values
what is a target distribution?
A distribution in the linear exponential family; choose one that aligns with the characteristics of the target
Important considerations when choosing a link function
1) ensure the predictions match the range of values of the target mean
2) ensure ease of interpretation (log link)
3) canonical links make convergence more likely
Common distributions
Normal, Binomial, Poisson, Gamma, inverse gaussian, tweedie
Normal distribution variable type and common link
real-valued with a bell-shaped dist.
identity link
Binomial variable type and common link
Binary (0/1)
logit link
Poisson variable type and common link
Count (>=0, integers)
Log link
Gamma, inverse gaussian variable type and common link
positive, continuous with right skew
log link
tweedie variable type and common link
> = 0, continuous with a large mass at zero
log link
methods for handling non-monotonic relations
GLMs, in their basic form, assume that numeric predictors have a monotonic relationship with the target variable
1) polynomial regression
2) binning
3) piecewise linear functions
polynomial regression
add polynomial terms to the model equation
pros: can take care of more complex relationships between the target variable and predictors. the more polynomial terms included, the more flexible the fit
cons: a) coefficients become harder to interpret (all polynomial terms move together) b) usually no clear choice in highest power; can be tuned by CV
Binning
“bin” the numeric variable and convert it into a categorical variable with levels defined as non-overlapping intervals over the range of the original variable
pros: no definite order among the coefficients of the dummy variables corresponding to different bins -> target mean can vary highly irregularly over the bins
cons: a) usually no clear choice of the no. of bins and the associated boundaries b) results in a loss of information (exact values of the numeric predictor gone)
adding piecewise linear functions
add features of the form (X-c)+
pros: a simple way to allow the relationship between a numeric variable and the target mean to vary over different intervals
cons: usually no clear choice of the break points
Handling categorical predictions - binarization
how it works: a categorical predictor becomes a collection of dummy (binary) variables indicating one and only one level and the dummy variables serve as predictors in model equation
baseline level
the level at which all dummy variables equal 0
R’s default: the alpha-numerically first level
Good practice: reset it to the most common level
interactions
need to “manually” include interaction terms of the product form XiXj, where the coefficient of Xi will vary with the value of Xj
interpretation of coefficients
coefficient estimates capture the effect (magnitude + direction) of features on the target mean
p-value statistical significance
the smaller the p-value, the more significant the feature
Offset: form of target variable and how they affect the mean/var of the target
form: aggregate (e.g., total number of claims in a group of similar policyholders)
affect: target mean is directly proportional to exposure
Weights: form of target variable and how they affect the mean/var of the target
form: average (e.g., average number of claims in a group of similar policyholders)
affect: variance is inversely related to exposure. observations with a larger exposure will play a more important role in model fitting
stepwise selection
sequentially add/drop features, one at a time, until there is no improvement in the selection criterion
Forward selection
start with intercept-only model, add variables until no improvement in model
tends to produce a simpler model
backward selection
starts with full model, drop variables until no improvement
selection criteria based on penalized likelihood
idea: prevent overfitting by requiring an included/retained feature to improve model fit by at least a specified amount
common choices are AIC and BIC
AIC
AIC = -2l + 2(p+1)
penalty per parameter = 2
BIC
BIC = -2l + ln(n)*(p+1)
penalty per parameter = ln(n)
AIC vs BIC
for both, the lower the value, the better
BIC is more conservative and results in simpler models