GLM Flashcards
Describe the 2 components of a GLM
- Random component
Captures portion of variation driven by causes other than predictors in model (including pure randomness) - Systematic component
Portion of variation in a model that can be explained using predictor variables
Our goal in modelling with GLM is to shift as much of the variability as possible away from random component into systematic component.
Identify the variation function for the following distributions:
1. Normal
2. Poisson
3. Gamma
4. Inv Gaussian
5. NB
6. Binomial
7. Tweedie
- 1
- u
- u^2
- u^3
- u(1+ku)
- u(1-u)
- u^p
Define the 2 most common link functions
- Log: g(u) = ln(u)
Used for rating - Logit: g(u) = ln(u/(1-u))
Used for binary target (0,1)
List 3 advantages of log link function
Log link function transforms linear predictor in a multiplicative structure: u = exp(b0 + b1x1 +…)
Which has 3 advantages:
1. Simple and practical to implement
2. Avoids negative premiums that could arise if additive structure
3. Impact of risk characteristics is more intuitive
List 2 uses of offset terms
Allows you to incorporate pre-determined values for certain variables into model so GLM takes them as given.
2 uses:
1. Deductible factors are often developed outside of model
ln(u) = b0 + b1x1 +…+ ln(Rel(1-LER))
- Dependent variables varies directly on particular measure (e.g. exposures)
ln(u) = b0 + b1x1 +…+ ln(1 car year)
Describe the steps to calculate offset
- Calculate unbiased factor 1 - LER
- Rebase factor: Rel = Factor(i) / Factor(base)
- Offset = g(rebased factor)
- Include fixed offsets before running GLM st all estimated coefficients for other predictors are optimal in their presence
Describe 3 methods to assess variable significance
- Standard error
Estimated std dev of random process
Small value indicates estimate is expected to be relatively close to true value.
Large range indicates that wide range of estimates an be achieved through randomness. - P-value
Probability of at least the value arising by pure chance.
H0: Beta(i) = 0
H1: Beta(i) different than 0
Small value indicates we have a small chance of observing coeff randomly. - Confidence interval
Gives a range of possible values for a coefficient that would not be rejected at a given p-threshold
95% CI would be based on a 5% p-value
Describe 2 distributions appropriate for severity modelling and their 5 desired characteristics
Gamma and Inverse Gaussian
1. Right-skewed
2. Lower bound at 0
3. Sharp peaked (inv gauss > gamma)
4. Wide tail (inv gauss > gamma)
5. Larger claims have more variance (u^2, u^3)
Describe 2 distributions appropriate for frequency modeling
- Poisson
Dispersion parameter adds flexibility (allows var > mean)
Poisson and ODP will produce same coefficients but model diagnostics will change (var understated = distorted std error and p-value) - Negative Binomial
Poisson with mean follows gamma
Describe 2 characteristics that should have frequency error distribution
- Non-negative
- Multiplicative relationship fits frequency better than additive relationship
Describe which distribution is appropriate for pure premium / LR modeling and gives 3 reasons/desired characteristics
Tweedie:
1. Mass point at zero (lots of insured have no claims)
2. Right-skewed
3. Power parameter allows some other distributions to be special cases (p=0 if normal, p=1 if poisson, p=2 if Gamma, p=3 if Inv Gauss)
What happens where power parameter of tweedie between 1 and 2
Compound poisson freq & gamma sev
Smoother curve with no apparent spike
Implicit assumption that freq & sev move in same direction (often not realistic but robust enough)
Calculate mean of Tweedie
lambda * alpha * theta
Calculate power parameter of Tweedie
p = (a+2) / (a+1)
Calculate dispersion parameter of Tweedie
lambda^(1-p) * (a*theta)^(2-p) / (2-p)
Identify 3 ways to determine p parameter of Tweedie
- Using model-fitting software (can slow down model)
- Optimization of metric (e.g. log-likelihood)
- Judgmental selection (often practical choice as p tends to have small impact on model estimates)
Describe which distribution is appropriate for probability modeling
Binomial
Use mean as modelled prob of event occurring
Use logic function:
u = 1/(1+exp(-x))
Odds = u/(1-u)
It is good practice to log continuous variables before using in model.
Explain why and give 2 exceptions.
Forces alignment of predictors scale to that of entity they are predicting. Allows flexibility in fitting different curve shapes.
2 exceptions:
1. Using a year variable to pick up trend effects
2. If variable contains values of 1 (since ln(1) undefined)
Why do we prefer choosing level with most observations as base level
Otherwise, there will be wide CI around coefficients estimates (although same predicted relativities)
Discuss how high correlation between 2 predictor variables can impact GLM
Main benefit of GLM over univariate analysis is being able to handle exposure correlation.
However, GLM run into problems when predictor variables are very highly correlated. This can result in unstable model, erratic coefficients and high standard errors.
Describe 2 options to deal with very high correlation in GLM
- Remove all highly correlated variables except one
This eliminates high correlation in model, but also potentially loses some unique info contained in eliminated variables. - Use dimensionality-reduction techniques such as components analysis or factor analysis to create a subset of variables from correlated variables and use subset in GLM.
Downside is the additional time required to do that extra analysis.
Describe multicollinearity, its potential impacts and how to detect
Occurs when there is a near-perfect linear dependency among 3 or more predictor variables.
When exists, the model may become unstable with erratic coefficients and may not converge to a solution
One way to detect is to use variation inflation factor (VIF) which measures impact on square error of a predictor due to presence of collinearity with other predictors.
VIF of 10 or more is considered high and would indicate to look into collinearity structure to determine how to best adjust model.
Describe aliasing
Aliasing occurs when there is a perfect linear dependency among predictor variables (ex: when missing data are excluded)
The GLM will not converge (no unique solution) or if it does, coefficients will make no sense.
Most GLM will detect and automatically remove one of the variables.
Identify 2 important limitations of GLM
- Give full credibility to data
Estimated coefficients are not cred-wtd to recognize low volumes of data or high volatility. - Assume randomness of outcomes are uncorrelated
This is an issue in 2 cases:
a. Using dataset with several renewals of same policy since likely to have correlated outcomes
b. when data can be affected by weather: likely to cause similar outcomes to risks in same area
Some extensions of GLM (GLMM or GEE) can help account for such correlation in data
List the 9 steps to build a model
Hint: Obeying Simple Directions Elicits Fully Outstanding Very Powerful Model Results
- setting goals and Objectives
- communicate with key Stakeholders
- collect & process Data
- conduct Explanatory data analysis
- specify model Form
- evaluate model Output
- Validate model
- translate model results into Product
- Maintain & Rebuild model
Discuss 2 considerations/potential issues in matching policy and claims
- Matching claims to specific vehicles/drivers or coverages
- Are there timing differences between datasets? How often is each updated? Timing diff can cause record matching problems
- Is there a unique key to merge data (ex: policy number). Potential for orphaned claims or duplicating claims if multiple policy records.
- Level of aggregation before merging, time dimension (CY vs PY), policy level vs claimant level, location level or per risk level
Discuss 2 considerations in modifying (cleaning) data prior to modeling
- check for duplicate records and remove them prior to aggregation
- check categorical fields against documentation (new codes, errors)
- check reasonability of numerical fields (negative premium, outliers)
- Decide how to handle errors and missing values (discard or replace with average values)
- Convert continuous variables to categorical (bining)
Discuss possible data adjustments prior to modeling
- Cap large losses and remove cats
- Develop losses
- On-level premiums
- Trend exposures and losses
- Use time variable in model to control these effects (not as good as other adjustments), e.g.: group ages by range
Why don’t we train and test on same dataset
Would be inappropriate since would give biased results of model performance
More variables will always cause model to fit training data better (overfitting) but may not fit other datasets better since begins assuming random noise in data is part of systematic component
We want to pick as much signal as possible with minimal noise
Describe 3 model testing strategies
- Train & test
Split data into 1 training set and 1 testing set (usually 60/40 or 70/30)
Can split randomly or on time basis
Adv of time: weather events not in both datasets so results are not overly optimistic - Train, validate & test
Split data into 3 sets. Validation set can be used to refine model and tweak before test set. (40/30/30) - Cross validation
Most common is k-fold.
Pick number k and split data into k groups
For each fold, train model using k-1 folds and test model using kth fold.
Tend to be superior since more data used in both training and testing but extremely time-consuming
Identify 4 advantages of modeling freq and sev separately over pure premium
- Gain more insight and intuition about impact of each predictor variable
- Each is more stable (variable that only impacts freq will look less significant in pure premium model)
- PP can lead to overfitting if predictor variables only impact freq or sev but not both since randomness of other may be considered signal effect
- Tweedie distribution assumed both freq and sev move in same direction which may not be true
Identify 2 disadvantages of modeling freq and sev separately
- Requires data to be available
- Takes more time to build 2 models