A2. Generalized Linear Models for Insurance Rating Flashcards
Problems with one-way analysis rating
- Can be distorted by correlations between rating variables
- youth are more concentrated in some areas => territory and age are correlated - Does not consider interdependencies/interactions between rating variables
- youth+sport is extra risky, but eldery+sport is extra careful => sport interacts with age
Problems with minimum bias rating
- Lack of statistical framework to asses quality of model:
- 1 cannot test significance of a variable
- 2 no credibility ranges for parameters ( glm can provide confidence interval)
- Iterative calculations are computationally inefficient
Assumptions in classical linear/additive models
- all observations are independent
- observations are normally distributed
- each risk group has constant variance
- effects are additive: mean is a linear combination of covariates
2 limitations of classical linear/additive models
- difficult to assert normality and constant variance of response variables
- if Y>0 => then not normal
- if Y>0 and E(Y) tends to 0 => then Var(Y) tends to 0 (not constant) - mean is not always a linear combination of covariates
- many insurance risks tend to vary multiplicately with rating variables
- additive assumptions are not realistic for insurance applications
Assumptions in GLMs
- all observations are independent
- observations distribution is from the exponential family
- link function is differentiable and monotonic
- effects may be non-linear: mean is the inverse of the link function, which is a linear combination of covariates
So no longer tie to :
- NORMALITY Assumption
- CONSTANT VARIANCE Assumption
- ADDITIVITY OF EFFECTS Assumption
pro : Adjusts for correlation, with less restrictive assumptions to classical linear models
con : Often difficult to explain results
How to transform an additive model in a multiplicative rating plan
log link function
Advantages of GLMs with log link rating plan as compared to additive model
- simple and practical to implement ***
- guarantee positive premiums
- multiplicative impact of risk characteristic more intuitive
Advantages/disadvantages of modeling frequency and severity separately
Advantages:
-more insight/intuition about the impact of each predictor
- more stability of both models since a predictor affecting only frequency may be diluted in a pure premium model
- less overfitting since a predictor affecting only frequency may also catch up the noise of severity in a pure premium model
- frequency may not move in the same direction as severity but that is a strong assumption of Tweedie/pure premium models
Disadvantages:
- more detailed data required => may not be available
- need to build 2 models => more time
Why coverage related variables (deductible, limit) should be first priced outside of GLMs
- Violation of Tweedie assumption: frequency and severity never move in the same direction for those variables. if higher deductible => frequency decrease, severity increase
- Counterintuitive results: may indicate a lower rate for higher coverage
- 1 if correlations with other variables outside the model
- 2 if adverse selection of insureds self-selecting higher deductible because they know they have higher loss potential and want to reduce the prm
- 3 if underwriters forcing high risk insureds to select higher deductibles
*Deductible relativities should be determined based purely on loss elimination, outside of GLM model. Then included in the GLM model as an offset in the log link function ( + ln ( relativities))
relativities = factor y / factor base level
factor = 1 - LER
not LER !!
What happens if coverage related variables (deductible, limit) are priced within GLMs
- rates will reflect other things than pure loss elimination
- insureds will change their behavior
- therefore rate based on past experience (and past behaviors) will no longer be predictive of new policies
Why territories should be priced outside of GLMs
- may be a large number of territories
- but aggregating the territories into a smaller number of groups may cause loss of information
How territories should be priced
Step 1: Estimate territory relativities using spatial smoothing and by including the rest of the classification model in offset
Step 2: Estimate the rest of the classification plan using GLM and by including the territory relativities in offset
Iterate steps 1 and 2 until both converge
Impact of choosing a level with fewer observations as the base level of a categorical variable
- higher standard error and p-value => wider confidence intervals around the estimated coefficients
- but the predicted relativities will be the same (rebased to the chosen base level)
When using a log link function, why continuous predictor variables should usually be logged and exceptions
Reason:
- should log prediction to allows more flexibility in fitting different curve shapes to the data (if not logged => only allows for exponential growth)
Exceptions:
- year variable (used to pick up trend effects)
- variable containing values of 0 (ln(0) is undefined, unless 1 is added to all observations before taking the log)
3 cautions when plotting actual vs predicted values for model selection
- use holdout data to prevent overfit
- aggregate data before plotting based on percentiles of predicted values
- take the log of all values before plotting to prevent large values from skewing picture
GLM outputs for each predicted coefficient
- standard error:
- Definition: estimated std dev of random process that estimate coefficient
- Use: p-value and confidence interval
- Limitation: ***based of the Cramer-Rao lower bound => could be understated - p-value:
-Definition: probability of an estimated coefficient having its magnitude different than 0 by pure chance given that the true coefficient is 0
Focus: variable significance
- if small p value, variable shoud be included in the model
-Limitation: does not give the probability that the true coefficient is 0
- more observation = smaller p valur
- small dispersion parameter = smaller p value - confidence interval:
- Definition: range of estimates that would not be rejected given a selected threshold for the p-value
- If interval very narrowed , should add the new variable into model
- Focus: variable significance
Problem and options for GLMs with highly correlated variables
Problems:
- unstable model
- erratic coefficients
- high standard errors
Option 1: remove all highly correlated variable except one
- this eliminated the high correlation
- disadvantage: potentially loses some unique information in the eliminated variables
Option 2: use dimensionality-reduction techniques (principal component analysis)
- creates a new subset of uncorrelated variables from the correlated variables by identifying which variables are most predictive ** of the variance between classes***.
- Allow any other highly correlated variable to be removed resulting in a simpler model.
- use this subset of uncorrelated variables in the GLM
- disadvantage: additional time required
- Suited for developing individual, aggregate variables that summarize signal.
Define multicollinearity and give a way to detect it
Definition:
-two or more predictor are strongly predictive of a third predictor => near-perfect linear dependency among 3 or more predictors
Problem:
- erratic coefficients
- unstable model
- model may even not converge
Detection: use variance inflation factor (VIF)
how much the squared standard error for the predictor is increased due to the presence of collinearity with other predictors. It is determined by running a linear model for
each of the predictors using all the other predictors as inputs,*
-if VIF>10 => variable has multicollinearity
Define aliasing and its solutions
Definition:
-perfect **linear **dependency among predictor variables
Problem:
-model will never converge
Solutions:
- Manually: remove or reclassify aliased records in another factor level
- GLM softwares: automatically remove one of the aliased variables
Types of aliasing
- Intrinsic aliasing:
- perfect dependency between 2 predictors are inherent to the definition of the variables
- ex1: if the model includes all levels of a categorical variable, last=1-sum(others).
- ex 2 : utilise age & birth date - Extrinsic aliasing:
- perfect dependency between 2 predictors are from the nature of the data
- ex: all red cars of the data happen to all be 2-door sedans AND vice-versa - Near-aliasing: ( same as multicollinearity )
- almost perfect dependency between 2 or more predictors
- ex: all red cars of the data happen to all be 2-door sedans ( but not vice versa)
- convergence problem may occur
Deviance residual
this is the amount that a given observation contributes to the overall deviance
in a well fit model, deviance residuals will
follow no predictable pattern
will be normall distributed
have constant variance
Possible transformations after reviewing partial residual graph
- Binning into a categorical variable with separate bins
- help differentiate the difference in residuals
- Disadvantages:
- increases degrees of freedom
- can result in inconsistent/impractical pattern
- ignores variations within bins *** - Adding a polynomial terms
- Disadvantages: loss of interpretability without a graph - Adding piecewise linear/hinge functions
Allow to track the different slope of residuals
-Disadvantages: break points must be chosen manually(judgmental)
3 options for measuring model stability
- Cook’s distance
- measures** influence of an individual record **on the model
- check records with the highest Cook’s distance if they should be excluded ( highest distance = more influence the variable has on the model) - Cross-validation
-create sub datasets with fewer number of records by sampling without replacement
- split data into k parts and run the model on the k-1 parts , then validate the result on the last part.
-check parameter estimates which vary the most across different model runs if they should be excluded
- superior since more data use to train and test
extremly time consuming and
less common in insurance since variables are often hand-picked - Bootstrapping
- create new datasets with the same number of records by sampling with replacement
- run model on each sampled dataset
- check parameter means and variance after refitting model on many new datasets
2 measures used in diagnostic tests of overall model
- Log-likelihood
- Definition: log of the product of the likelihood for all observations
- Lower bound: log-likelihood of null model (no predictor)
- Upper bound: log-likelihood of saturated model (1 predictor for each observation) - Deviance
-Definition: generalized form of the SSE
-Lower bound: 0 ( saturated model)
-Upper bound: model has no predictor. Represent total deviance inherent in the data
Can test between two “nested” models to see if the inclusion of the additional factor improves the model enough given the extra parameter it adds to the model