GLM Flashcards
Describe the 2 components of a GLM
- Random component
Captures portion of variation driven by causes other than predictors in model (including pure randomness) - Systematic component
Portion of variation in a model that can be explained using predictor variables
Our goal in modelling with GLM is to shift as much of the variability as possible away from random component into systematic component.
Identify the variation function for the following distributions:
1. Normal
2. Poisson
3. Gamma
4. Inv Gaussian
5. NB
6. Binomial
7. Tweedie
- 1
- u
- u^2
- u^3
- u(1+ku)
- u(1-u)
- u^p
Define the 2 most common link functions
- Log: g(u) = ln(u)
Used for rating - Logit: g(u) = ln(u/(1-u))
Used for binary target (0,1)
List 3 advantages of log link function
Log link function transforms linear predictor in a multiplicative structure: u = exp(b0 + b1x1 +…)
Which has 3 advantages:
1. Simple and practical to implement
2. Avoids negative premiums that could arise if additive structure
3. Impact of risk characteristics is more intuitive
List 2 uses of offset terms
Allows you to incorporate pre-determined values for certain variables into model so GLM takes them as given.
2 uses:
1. Deductible factors are often developed outside of model
ln(u) = b0 + b1x1 +…+ ln(Rel(1-LER))
- Dependent variables varies directly on particular measure (e.g. exposures)
ln(u) = b0 + b1x1 +…+ ln(1 car year)
Describe the steps to calculate offset
- Calculate unbiased factor 1 - LER
- Rebase factor: Rel = Factor(i) / Factor(base)
- Offset = g(rebased factor)
- Include fixed offsets before running GLM st all estimated coefficients for other predictors are optimal in their presence
Describe 3 methods to assess variable significance
- Standard error
Estimated std dev of random process
Small value indicates estimate is expected to be relatively close to true value.
Large range indicates that wide range of estimates an be achieved through randomness. - P-value
Probability of at least the value arising by pure chance.
H0: Beta(i) = 0
H1: Beta(i) different than 0
Small value indicates we have a small chance of observing coeff randomly. - Confidence interval
Gives a range of possible values for a coefficient that would not be rejected at a given p-threshold
95% CI would be based on a 5% p-value
Describe 2 distributions appropriate for severity modelling and their 5 desired characteristics
Gamma and Inverse Gaussian
1. Right-skewed
2. Lower bound at 0
3. Sharp peaked (inv gauss > gamma)
4. Wide tail (inv gauss > gamma)
5. Larger claims have more variance (u^2, u^3)
Describe 2 distributions appropriate for frequency modeling
- Poisson
Dispersion parameter adds flexibility (allows var > mean)
Poisson and ODP will produce same coefficients but model diagnostics will change (var understated = distorted std error and p-value) - Negative Binomial
Poisson with mean follows gamma
Describe 2 characteristics that should have frequency error distribution
- Non-negative
- Multiplicative relationship fits frequency better than additive relationship
Describe which distribution is appropriate for pure premium / LR modeling and gives 3 reasons/desired characteristics
Tweedie:
1. Mass point at zero (lots of insured have no claims)
2. Right-skewed
3. Power parameter allows some other distributions to be special cases (p=0 if normal, p=1 if poisson, p=2 if Gamma, p=3 if Inv Gauss)
What happens where power parameter of tweedie between 1 and 2
Compound poisson freq & gamma sev
Smoother curve with no apparent spike
Implicit assumption that freq & sev move in same direction (often not realistic but robust enough)
Calculate mean of Tweedie
lambda * alpha * theta
Calculate power parameter of Tweedie
p = (a+2) / (a+1)
Calculate dispersion parameter of Tweedie
lambda^(1-p) * (a*theta)^(2-p) / (2-p)
Identify 3 ways to determine p parameter of Tweedie
- Using model-fitting software (can slow down model)
- Optimization of metric (e.g. log-likelihood)
- Judgmental selection (often practical choice as p tends to have small impact on model estimates)
Describe which distribution is appropriate for probability modeling
Binomial
Use mean as modelled prob of event occurring
Use logic function:
u = 1/(1+exp(-x))
Odds = u/(1-u)
It is good practice to log continuous variables before using in model.
Explain why and give 2 exceptions.
Forces alignment of predictors scale to that of entity they are predicting. Allows flexibility in fitting different curve shapes.
2 exceptions:
1. Using a year variable to pick up trend effects
2. If variable contains values of (since ln(1) undefined)
Why do we prefer choosing level with most observations as base level
Otherwise, there will be wide CI around coefficients estimates (although same predicted relativities)
Discuss how high correlation between 2 predictor variables can impact GLM
Main benefit of GLM over univariate analysis is being able to handle exposure correlation.
However, GLM run into problems when predictor variables are very highly correlated. This can result in unstable model, erratic coefficients and high standard errors.
Describe 2 options to deal with very high correlation in GLM
- Remove all highly correlated variables except one
This eliminates high correlation in model, but also potentially loses some unique info contained in eliminated variables. - Use dimensionality-reduction techniques such as components analysis or factor analysis to create a subset of variables from correlated variables and use subset in GLM.
Downside is the additional time required to do that extra analysis.
Describe multicollinearity, its potential impacts and how to detect
Occurs when there is a near-perfect linear dependency among 3 or more predictor variables.
When exists, the model may become unstable with erratic coefficients and may not converge to a solution
One way to detect is to use variation inflation factor (VIF) which measures impact on square error of a predictor due to presence of collinearity with other predictors.
VIF of 10 or more is considered high and would indicate to look into collinearity structure to determine how to best adjust model.
Describe aliasing
Aliasing occurs when there is a perfect linear dependency among predictor variables (ex: when missing data are excluded)
The GLM will not converge (no unique solution) or if it does, coefficients will make no sense.
Most GLM will detect and automatically remove one of the variables.
Identify 2 important limitations of GLM
- Give full credibility to data
Estimated coefficients are not cred-wtd to recognize low volumes of data or high volatility. - Assume randomness of outcomes are uncorrelated
This is an issue in 2 cases:
a. Using dataset with several renewals of same policy since likely to have correlated outcomes
b. when data can be affected by weather: likely to cause similar outcomes to risks in same area
Some extensions of GLM (GLMM or GEE) can help account for such correlation in data