A2. GLMs Flashcards
Var(Y_i)
Formula
ΦV(μ_i) / ω_i
* Φ ~ Dispersion parameter
* V(μ_i) ~ Variance Function
* ω_i ~ Weights
Variance Function V(μ)
Normal, Poisson, Gamma, Inverse Normal, Tweedie, Binomial + common uses
Normal: V(μ) = 1 (not normally distributed)
Poisson: V(μ) = μ (best for freq models given its discrete nature)
Gamma: V(μ) = μ^2 (severity models, right skewed/tailed)
Inverse Gaussian: V(μ) = μ^3 (severity model, better for more skewed)
Tweedie: V(μ) = μ^p (pure prem models, assumes freq/sev move in same direction)
Variance Function V(μ)
Binomial, Negative Binomial + common uses
Binomial: V(μ) = μ(1-μ) = npq (logistic regression models)
Negative Binomial = μ(1+Kμ) (frequency models)
Deviance / F-Test
Formula for scaled and unscaled, F-Test formula, df, reject/accept Ho
Scaled D’ = 2(log-like saturated model - log-like model)
Unscaled D = ΦD’
F statistic = UNSCALED (Ds - Db) / # of added parameters * Φ_b
df1 (columns) = # of added parameters
df2 (rows) = n - p_b
Reject Ho (use bigger model) if F-stat > table value
Fail to reject Ho (keep smaller model) otherwise
AIC / BIC
Formula, which is more reasonable in insurnace
AIC = -2*log-like +2p
BIC = -2*log-like +pln(n)
AIC more reasonable since n gets very large for larger datasets
Offset Term
What is it, when do you add it to model, examples
Offset term allows you to incorporate pre-determined values for variables in your model (ex. deductible, policy term, etc)
Add offsets BEFORE running the GLM so that all estimated coefficients (B0, B1, etc) for other predictors are optimal in the presence of the offset
GLM Limitations
GLMs give full credibility
* The estimated coefficients are not credibility wieghted to recognize low volumes of data or high volatility. This can be partially addressed by looking at p-values and standard errors
GLMs assume that the randomness of outcomes are uncorrelated, which may not be true in practice
* Weather events can cause similar outcomes to risks in same area
* Using a dataset with several renewals, the same insured will have correlated/similar outcomes
Correlation Among Predictor Variables
What happens to the model? How to Check? Solutions?
What could happen:
* Model may not converge
* Unstable model, unstable coefficients w/ high standard errors
How to Check
* Variance Inflation Factor (VIF) > 10
Solutions
* Remove all highly correlated variables except 1
* Use principle components analysis or factor analysis to create a new subset of variables to use in the GLM
Key Stakeholders of Predictive Modeling Project
- Regulators - need to check if variables are legal to use and this varies by state
- IT - consider IT limitations of project and the cost of programming changes
- Agents/UWs - these people sell the insurance, it is important for them to understand the new rating structure
Deductible or Limits in GLMs?
Yes or no, explain
Coverage related variables in GLMs may produce unintuitive results such as lower rates for lower deductibles. This could be due to correlations with other variables outside the model.
Instead, should use a LER analysis or ILF analysis and incorporate deductible/limits as an offset term
Model Stability
What is it? How to check?
Stable model is not very sensitive to changes in the modeling data (add/remove a large loss)
Ways to Measure:
* The influence of an individual record can be measured using Cook’s distance. Records with high Cook’s distance should be given extra thought as to whether it should be included in the dataset or not
* Cross Validation - compare parameter estimates across different model runs
* Bootstrapping - used to create new datasets from the original dataset by randomly sampling with replacement. Compare parameter estimates across different runs
ROC Curve / Evaluation of Model
Sensitivity / Specificity / Discrimination Threshold / How to Plot
Sensitivity = True Positives / Total Actual Positives
Specificity = True Negatives / Total Actual Negatives
Discrimination Threshold = x
If predicted prob ≥ x → assign True otherwise False
Plotting ROC Curve:
* x-axis: 1 - specificity
* y-axis: sensitivity
* line of equality (0%,0%)→(100%,100%)
AUROC (area under ROC) higher the better
Pro/Con of Modeling Frequency & Severity Separately
Advantage
* More insight and intuition about the impact of each predictor variable
* Tweedie distribution (most common distribution for modeling pure premium) assumes both frequency and severity move in the same direction, but this is often unrealistic
* Modeling pure premium can lead to overfitting if a predictor variable only impacts frequency or severity but not both
* Each of frequency and severity separately is more stable
Disadvantage
* Takes more time
* Claim level data may not be available to model frequency and severity separately