GLM Flashcards
Describe the 2 components of a GLM
- Random component
Captures portion of variation driven by causes other than predictors in model (including pure randomness) - Systematic component
Portion of variation in a model that can be explained using predictor variables
Our goal in modelling with GLM is to shift as much of the variability as possible away from random component into systematic component.
Identify the variation function for the following distributions:
1. Normal
2. Poisson
3. Gamma
4. Inv Gaussian
5. NB
6. Binomial
7. Tweedie
- 1
- u
- u^2
- u^3
- u(1+ku)
- u(1-u)
- u^p
Define the 2 most common link functions
- Log: g(u) = ln(u)
Used for rating - Logit: g(u) = ln(u/(1-u))
Used for binary target (0,1)
List 3 advantages of log link function
Log link function transforms linear predictor in a multiplicative structure: u = exp(b0 + b1x1 +…)
Which has 3 advantages:
1. Simple and practical to implement
2. Avoids negative premiums that could arise if additive structure
3. Impact of risk characteristics is more intuitive
List 2 uses of offset terms
Allows you to incorporate pre-determined values for certain variables into model so GLM takes them as given.
2 uses:
1. Deductible factors are often developed outside of model
ln(u) = b0 + b1x1 +…+ ln(Rel(1-LER))
- Dependent variables varies directly on particular measure (e.g. exposures)
ln(u) = b0 + b1x1 +…+ ln(1 car year)
Describe the steps to calculate offset
- Calculate unbiased factor 1 - LER
- Rebase factor: Rel = Factor(i) / Factor(base)
- Offset = g(rebased factor)
- Include fixed offsets before running GLM st all estimated coefficients for other predictors are optimal in their presence
Describe 3 methods to assess variable significance
- Standard error
Estimated std dev of random process
Small value indicates estimate is expected to be relatively close to true value.
Large range indicates that wide range of estimates an be achieved through randomness. - P-value
Probability of at least the value arising by pure chance.
H0: Beta(i) = 0
H1: Beta(i) different than 0
Small value indicates we have a small chance of observing coeff randomly. - Confidence interval
Gives a range of possible values for a coefficient that would not be rejected at a given p-threshold
95% CI would be based on a 5% p-value
Describe 2 distributions appropriate for severity modelling and their 5 desired characteristics
Gamma and Inverse Gaussian
1. Right-skewed
2. Lower bound at 0
3. Sharp peaked (inv gauss > gamma)
4. Wide tail (inv gauss > gamma)
5. Larger claims have more variance (u^2, u^3)
Describe 2 distributions appropriate for frequency modeling
- Poisson
Dispersion parameter adds flexibility (allows var > mean)
Poisson and ODP will produce same coefficients but model diagnostics will change (var understated = distorted std error and p-value) - Negative Binomial
Poisson with mean follows gamma
Describe 2 characteristics that should have frequency error distribution
- Non-negative
- Multiplicative relationship fits frequency better than additive relationship
Describe which distribution is appropriate for pure premium / LR modeling and gives 3 reasons/desired characteristics
Tweedie:
1. Mass point at zero (lots of insured have no claims)
2. Right-skewed
3. Power parameter allows some other distributions to be special cases (p=0 if normal, p=1 if poisson, p=2 if Gamma, p=3 if Inv Gauss)
What happens where power parameter of tweedie between 1 and 2
Compound poisson freq & gamma sev
Smoother curve with no apparent spike
Implicit assumption that freq & sev move in same direction (often not realistic but robust enough)
Calculate mean of Tweedie
lambda * alpha * theta
Calculate power parameter of Tweedie
p = (a+2) / (a+1)
Calculate dispersion parameter of Tweedie
lambda^(1-p) * (a*theta)^(2-p) / (2-p)
Identify 3 ways to determine p parameter of Tweedie
- Using model-fitting software (can slow down model)
- Optimization of metric (e.g. log-likelihood)
- Judgmental selection (often practical choice as p tends to have small impact on model estimates)
Describe which distribution is appropriate for probability modeling
Binomial
Use mean as modelled prob of event occurring
Use logic function:
u = 1/(1+exp(-x))
Odds = u/(1-u)
It is good practice to log continuous variables before using in model.
Explain why and give 2 exceptions.
Forces alignment of predictors scale to that of entity they are predicting. Allows flexibility in fitting different curve shapes.
2 exceptions:
1. Using a year variable to pick up trend effects
2. If variable contains values of 1 (since ln(1) undefined)
Why do we prefer choosing level with most observations as base level
Otherwise, there will be wide CI around coefficients estimates (although same predicted relativities)
Discuss how high correlation between 2 predictor variables can impact GLM
Main benefit of GLM over univariate analysis is being able to handle exposure correlation.
However, GLM run into problems when predictor variables are very highly correlated. This can result in unstable model, erratic coefficients and high standard errors.
Describe 2 options to deal with very high correlation in GLM
- Remove all highly correlated variables except one
This eliminates high correlation in model, but also potentially loses some unique info contained in eliminated variables. - Use dimensionality-reduction techniques such as components analysis or factor analysis to create a subset of variables from correlated variables and use subset in GLM.
Downside is the additional time required to do that extra analysis.
Describe multicollinearity, its potential impacts and how to detect
Occurs when there is a near-perfect linear dependency among 3 or more predictor variables.
When exists, the model may become unstable with erratic coefficients and may not converge to a solution
One way to detect is to use variation inflation factor (VIF) which measures impact on square error of a predictor due to presence of collinearity with other predictors.
VIF of 10 or more is considered high and would indicate to look into collinearity structure to determine how to best adjust model.
Describe aliasing
Aliasing occurs when there is a perfect linear dependency among predictor variables (ex: when missing data are excluded)
The GLM will not converge (no unique solution) or if it does, coefficients will make no sense.
Most GLM will detect and automatically remove one of the variables.
Identify 2 important limitations of GLM
- Give full credibility to data
Estimated coefficients are not cred-wtd to recognize low volumes of data or high volatility. - Assume randomness of outcomes are uncorrelated
This is an issue in 2 cases:
a. Using dataset with several renewals of same policy since likely to have correlated outcomes
b. when data can be affected by weather: likely to cause similar outcomes to risks in same area
Some extensions of GLM (GLMM or GEE) can help account for such correlation in data
List the 9 steps to build a model
Hint: Obeying Simple Directions Elicits Fully Outstanding Very Powerful Model Results
- setting goals and Objectives
- communicate with key Stakeholders
- collect & process Data
- conduct Explanatory data analysis
- specify model Form
- evaluate model Output
- Validate model
- translate model results into Product
- Maintain & Rebuild model
Discuss 2 considerations/potential issues in matching policy and claims
- Matching claims to specific vehicles/drivers or coverages
- Are there timing differences between datasets? How often is each updated? Timing diff can cause record matching problems
- Is there a unique key to merge data (ex: policy number). Potential for orphaned claims or duplicating claims if multiple policy records.
- Level of aggregation before merging, time dimension (CY vs PY), policy level vs claimant level, location level or per risk level
Discuss 2 considerations in modifying (cleaning) data prior to modeling
- check for duplicate records and remove them prior to aggregation
- check categorical fields against documentation (new codes, errors)
- check reasonability of numerical fields (negative premium, outliers)
- Decide how to handle errors and missing values (discard or replace with average values)
- Convert continuous variables to categorical (bining)
Discuss possible data adjustments prior to modeling
- Cap large losses and remove cats
- Develop losses
- On-level premiums
- Trend exposures and losses
- Use time variable in model to control these effects (not as good as other adjustments), e.g.: group ages by range
Why don’t we train and test on same dataset
Would be inappropriate since would give biased results of model performance
More variables will always cause model to fit training data better (overfitting) but may not fit other datasets better since begins assuming random noise in data is part of systematic component
We want to pick as much signal as possible with minimal noise
Describe 3 model testing strategies
- Train & test
Split data into 1 training set and 1 testing set (usually 60/40 or 70/30)
Can split randomly or on time basis
Adv of time: weather events not in both datasets so results are not overly optimistic - Train, validate & test
Split data into 3 sets. Validation set can be used to refine model and tweak before test set. (40/30/30) - Cross validation
Most common is k-fold.
Pick number k and split data into k groups
For each fold, train model using k-1 folds and test model using kth fold.
Tend to be superior since more data used in both training and testing but extremely time-consuming
Identify 4 advantages of modeling freq and sev separately over pure premium
- Gain more insight and intuition about impact of each predictor variable
- Each is more stable (variable that only impacts freq will look less significant in pure premium model)
- PP can lead to overfitting if predictor variables only impact freq or sev but not both since randomness of other may be considered signal effect
- Tweedie distribution assumed both freq and sev move in same direction which may not be true
Identify 2 disadvantages of modeling freq and sev separately
- Requires data to be available
- Takes more time to build 2 models
Identify 4 considerations in variable selection
- Significance: we want to be confident effect of var is result of true relationship between predictor and target and not due to noise in data
- Cost-effectiveness of collecting data for variable
- Conformity with actuarial standards of practice and regulatory requirements
- Whether the quotation system can include the variable
How can we calculate partial residuals
ri = (yi - ui)g’(ui) + beta(j)xij
if log link:
(yi - ui)/ui + beta(j)*xij
Then they can be plotted against xj and line y = beta(j)*xj can be drawn to see how well line matches residual points
If systematic deviation of residuals from line is observed, what do we do?
We will want to transform the predictor variable in one of 4 ways:
- Bining variables: turning into categorical variable with separate bins
Removes need to constrain y to any particular shape, but increases df which can result in inconsistent or impractical patterns - Add polynomial terms (ex x^2, x^3)
Loss of interpretability without graph and can behave erratically at edges of data. - Add piecewise linear function (ex: hinge function)
Allows fit of wide range of non-linear patterns and easy to interpret, but must manually chose breakpoints and increase df. - Add natural cubic splines
Combines piecewise with polynomials which better fits data and smooth curve, but need graph to interpret.
When do we want to use interaction variables
When there is a response correlation between predictor variables (ex: when gender affected losses below a certain age)
Identify 3 advantages of centering at base level
- Other coefficients are easier to interpret, particularly true if interaction terms exist
- Intercept becomes more intuitive as avg frequency at base level
- Avoids counter-intuitive signs of coefficients when interaction
- Lower p-values for variable significance (tighter CI)
Describe 2 measures used in diagnostic tests for overall model fit
- Log-likelihood
Log of product of likelihood for all observations (sum of log-likelihood) - Scaled deviance
D* = 2(llsaturated - llmodel)
Unscaled D = dispersion parameter x D
GLMs are fit by maximizing ll so D is minimized
Describe the 2 conditions for validity of ll & D comparisons
- Same dataset is used to fit the 2 models
- Assumed distribution and dispersion param same for the 2 models
Describe 2 options to compare candidate models using ll & D
- F-Test
F = (Ds - Db) / n*phi
Ds is unscaled dev of smaller model
Db is unscaled dev of bigger model
n is number of added parameters
F(test) > F(table, dfnum = n, dfdenom = n-pb) means we prefer bigger model
For non-nested models, use penalized measures of fit:
2. AIC = -2ll + 2p
3. BIC = -2ll + p*ln(n)
Lower is better
Describe 3 measures of deviation of actual from predicted
- Raw residual (ex: yi - ui)
- Deviance residuals
(2philn(f(yi given yi=ui)) - ln(f(yi given ui=ui))^0.5
Take negative if yi < ui
Residual adjusted for shape of GLM
Should follow normal distribution with no predictable pattern
homoscedasticity: normally distributed with constant variance - Working residuals
wri = (yi - ui)*g’(ui)
if log link wri = (yi - ui)/ui
if logic wri = (yi - ui) / ui(1-ui)
Critical to bin residuals before analysis
Identify 3 options to measure model stability
- Cook’s distance
Higher indicates higher level of influence.
Records with highest value should be given additional scrutiny as to whether they should be included. - Cross-validation
Comare in-sample parameter estimates across model runs. Model should produce similar results when run on separate subsets of initial dataset. - Bootstrapping
Create a new dataset with same number of records by randomly sampling with replacement from original dataset.
Model can then be refit on different datasets and can get statistics like and mean var for each parameter estimate.
State 2 reasons why model refinement techniques may not be appropriate for model selection
- Some of the models may be proprietary
Info on data & detailed form need to be available to evaluate model - Final decision is often business call
Those deciding may know nothing about predictive modeling and actuarial science
Briefly explain scoring
Scoring is the process of using models to make predictions from individual records
It can be used in model selection
Should always score on holdout sample
Then we can use techniques for model selection
Describe 2 techniques for model selection
- Plot actual vs predicted
The closer the points are to the line y=x, the better the prediction - Lift-based measures:
Model lift = economic value of model/ability to prevent adverse selection
Attempts to visually demonstrate model’s ability to charge each insured an actuarially sound rate
Requires 2 or more competing models
List 4 lift-based measures for model selection
- Simple Quintile Plots
- Double Lift Charts
- Loss Ratio Charts
- Gini Index
Briefly explain the Simple Quintile Plots measure for model selection.
For each model:
a. Sort holdout dataset based on model’s predicted loss cost
b. Bucket data into quantiles having equal exposures
c. Calculate average predicted loss cost & avg actual loss cost for each bucket and plot them on graph.
Winning model should be based on 3 criteria:
1. Predictive accuracy
More consistently closer to overall average predicted loss cost
2. Monotonicity
Actual PP should consistently increase across all quantiles
3. Vertical distance of actuals between 1st and last quantile
Indicates how well the model distinguish between best and worst risks.
Briefly explain the double lift chart measure for model selection
Compares 2 models on same graph
Winning model is one that best matches actual in each quantile
Briefly explain the Loss Ratio Charts measure for model selection
Generally easier to understand since LR is the most commonly-used metric in insurance profitability
The greater the vertical distance between lowest and highest LRs, the greater the model does at identifying further segmentation opportunities not present in current plan.
We want LRs not equal and increasing monotonically.
Briefly explain the Gini Index measure for model selection
Quantify ability to identify best and worst risks
Gini index = 2 * area between 2 curves
higher value = better
How will changing discrimination threshold will impact amount of TP, FP, FN and TN in logistic model
Decreasing discrimination threshold = more true positives and more false positives since more will be investigated
Define sensitivity
ratio of true positives to total event occurrences
also called true positive rate or hit rate
Define specificity
ratio of true negatives to total event non-occurrences
false positive rate = 1 - specificity
Describe ROC curve
All possible combinations of sensitivity and 1-specificity for different discrimination thresholds
Help determine a target threshold (lower for large risks since we want to spend more time investigating)
AUROC is area under ROC curve
The higher the AUROC, the better
List 3 purposes of model documentation
- Opportunity to check your work for errors and improve communication skills
- Transfer knowledge to others that maintain or rebuild model
- Comply with stakeholder demands (ASOP41)
Identify 4 items to include in model documentation
- Everything needed to reproduce model (from source data to model output)
- All assumptions and justifications for all decisions
- All data issues encountered and resolution
- Any reliance on external models or external stakeholders
- Model performance, structure and shortcomings
- Compliance with ASOP41 or local actuarial standards on communication
Why should coverage-related variables be priced outside of GLM and included in offset terms
Examples: deductibles, limits, covered perils
Can give counter-intuitive results in GLM such as indicating lower rate for more coverage.
Could be due to correlation with other variables outside of model, including possible selection effects (insured self-selecting higher limits since they know they are higher risks)
Describe how territories can be priced in conjunction with GLMs
Challenging due to their large number and aggregating them may cause you to lose important information.
Techniques like spatial smoothing can be included in GLM as offset terms.
Territory model should also be offset for rest of classification plan = iterative process until each model converges to acceptable range
Discuss how ensemble models can improve performance of single model
Instead of choosing a single model from 2 or more, models can be combined into an ensemble of models (ex: avg of predictions = balances predictions)
Only works when model errors are as uncorrelated as possible, which happens when built by different people with little or no sharing.
Define intrinsic aliasing
When there are one covariate for every level of each variable
Provide 2 arguments against the inclusion of deductible as predictor in GLM analysis.
- Coverage variables in GLMs can given counter-intuitive results, such as indicating lower rate for more coverage.
- Charging rates for coverage options that reflect anything other than pure loss elimination could lead to changes in insured behaviour, which means indicated rates based on past experience will no longer be appropriate for new policies.
A variable with second-order polynomial adds how many degrees of freedom?
2
Describe how the exclusion of missing data may cause problems for the company in developing the model, and suggest a solution.
Missing data can lead to extrinsic alising.
This occurs when there are linear dependencies in the observed data because of the nature of the data.
Ex: the missing level for prior auto policy will be perfectly correlated with the missing level for home policy.
This can lead to convergence problems or confusing results
A solution would be to exclude these missing data records, or to reclassify them to an appropriate level
How do you consider multiple offsets
They can simply be added together into a total offset
List 3 shortcomings of GLMs and propose a solution for each.
- By default, predictions are based on linear relationship to predictors.
Solution: GAMs or MARS - Unstable when data is thin or when predictors are highly correlated.
Solution: GLMMs - Full credibility given to data for each coefficient regardless of thinness of data.
Solution: DGLMs, GLMMs - Random component of outcome is assumed to be uncorrelated among observations.
Solution: DGLMs - Scale parameter must be constant across all observations.
Solution: DGLMs
Name a disadvantage of Neural Nets to address GLM shortcomings.
Make model harder to interpret and explain.
Name 5 GLM variations that can address shortcomings while maintaining interpretability of model.
- Genralized Linear Mixed Models (GLMMs)
- Double Generalized Linear Models (DGLMs) aka GLMs with Dispersion Modeling
- Genralized Additive Models (GAMs)
- Multivariate Adaptative Regressions Splines (MARS)
- Elastic Net GLMs
Briefly describe Generalized Linear Mixed Models (GLMMs)
GLMM allows for some coefficients to be modelled as ransom variables themselves.
Predictor variables are split into random and fixed effects.
All levels of categorical random effect variables are included as predictors (no base level)
g(u) = b0 + b1x1 + … + lambda1z1 + lambda2z2 + …
Explain the shrinkage effect in GLMMs
Resulting estimate of each lambda will be closer to the grand mean prediction compared to full credibility estimates from a GLM.
The less data, the closer to the grand mean.
Explain the 2 steps to fit GLMM
- betas and dispersion parameters are estimated, lambda distribution is assumed.
- Lambda parameters are estimated.
Name another utility of GLMMs
Can also be used to account for distortion in GLM results from correlation among observations in data.
Ex: if data had multiple renewals of same policy, we could add policy ID as random effect.
Briefly describe Double-Generalized Linear Models (aka GLMs with Dispersion Modeling)
DGLMs allo for scale parameter phi to vary by observation with phi_i modelled as linear combination of predictors.
This gives less weight to volatile data and more weight to stable data (ignore more noise and pick up more signal)
Describe 3 cases where DGLMs are particularly useful
- Some classes of business are more volatile than others
- You care about having an accurate predicted variable of each observation, not just the mean
- When using Tweedie distribution to model PP or Les, DGLMs provide flexibility for underlying frequency and severity to move in opposite direction
How can we estimate DGLMs?
Can be estimated using statistical softwares.
If y distribution is Tweedie, we can use regular GLM software through iterative process.
Provide a downside of GAMs.
GAMs do not provide a simple coefficient for a predictor, its relationship to target variable must be examined graphically.
Describe Generalized Additive Models (GAMs)
GAMs allow for non-linear effects in the model.
g(u) = b0 + f1(xi1) + f2(xi2) + … fp(xip)
Shape of f( ) would be determined by modeling software.
Be careful: allowing too much flexibility will risk overfitting.
Briefly explain MARS models
MARS allow us to incorporate non-linearity by using piecewise linear (hinge) functions.
The model will choose functions and cut points automatically.
Be careful: allowing too much flexibility will risk overfitting.
2 useful features of MARS
- Unlike GLM, MARS performs its own variable selection (will only keep those that are significant)
- Can be used to find non-linear transformations or interactions
Briefly describe Elastic Net GLMs
They are specified the same as regular GLMs, but when fitting the model, we add a penalty to deviance (which we try to reduce) based on size and magnitude of coefficients.
Provide a powerful means against overfitting even in presence of many predictors.
Deviance + lambda(a * sum of abs(b) + (1-a) * sum of b^2)
b are coefficients excl. intercept b0
lambda is the tuning parameter to control severity of penalty. As lambda increases, coefficients tend to 0 and penalty reduce.
a is the weight parameter.
sum of abs(b) is penalty in lasso models
sum of b^2 is penalty in ridge models
3 advantages and 1 disadvantage of Elastic Net GLMs
Advantages:
1. Equivalent to using credibility weighting for coefficients
2. Can perform automatic variable selection
3. Perform better than GLMs when predictors highly correlated
Disadvantage:
Computational complexity, especially with large datasets.