A2. Generalized Linear Models for Insurance Rating Flashcards
Problems with one-way analysis rating
- Can be distorted by correlations between rating variables
- youth are more concentrated in some areas => territory and age are correlated - Does not consider interdependencies/interactions between rating variables
- youth+sport is extra risky, but eldery+sport is extra careful => sport interacts with age
Problems with minimum bias rating
- Lack of statistical framework to asses quality of model:
- 1 cannot test significance of a variable
- 2 no credibility ranges for parameters ( glm can provide confidence interval)
- Iterative calculations are computationally inefficient
Assumptions in classical linear/additive models
- all observations are independent
- observations are normally distributed
- each risk group has constant variance
- effects are additive: mean is a linear combination of covariates
2 limitations of classical linear/additive models
- difficult to assert normality and constant variance of response variables
- if Y>0 => then not normal
- if Y>0 and E(Y) tends to 0 => then Var(Y) tends to 0 (not constant) - mean is not always a linear combination of covariates
- many insurance risks tend to vary multiplicately with rating variables
- additive assumptions are not realistic for insurance applications
Assumptions in GLMs
- all observations are independent
- observations distribution is from the exponential family
- link function is differentiable and monotonic
- effects may be non-linear: mean is the inverse of the link function, which is a linear combination of covariates
So no longer tie to :
- NORMALITY Assumption
- CONSTANT VARIANCE Assumption
- ADDITIVITY OF EFFECTS Assumption
pro : Adjusts for correlation, with less restrictive assumptions to classical linear models
con : Often difficult to explain results
How to transform an additive model in a multiplicative rating plan
log link function
Advantages of GLMs with log link rating plan as compared to additive model
- simple and practical to implement ***
- guarantee positive premiums
- multiplicative impact of risk characteristic more intuitive
Advantages/disadvantages of modeling frequency and severity separately
Advantages:
-more insight/intuition about the impact of each predictor
- more stability of both models since a predictor affecting only frequency may be diluted in a pure premium model
- less overfitting since a predictor affecting only frequency may also catch up the noise of severity in a pure premium model
- frequency may not move in the same direction as severity but that is a strong assumption of Tweedie/pure premium models
Disadvantages:
- more detailed data required => may not be available
- need to build 2 models => more time
Why coverage related variables (deductible, limit) should be first priced outside of GLMs
- Violation of Tweedie assumption: frequency and severity never move in the same direction for those variables. if higher deductible => frequency decrease, severity increase
- Counterintuitive results: may indicate a lower rate for higher coverage
- 1 if correlations with other variables outside the model
- 2 if adverse selection of insureds self-selecting higher deductible because they know they have higher loss potential and want to reduce the prm
- 3 if underwriters forcing high risk insureds to select higher deductibles
*Deductible relativities should be determined based purely on loss elimination, outside of GLM model. Then included in the GLM model as an offset in the log link function ( + ln ( relativities))
relativities = factor y / factor base level
factor = 1 - LER
not LER !!
What happens if coverage related variables (deductible, limit) are priced within GLMs
- rates will reflect other things than pure loss elimination
- insureds will change their behavior
- therefore rate based on past experience (and past behaviors) will no longer be predictive of new policies
Why territories should be priced outside of GLMs
- may be a large number of territories
- but aggregating the territories into a smaller number of groups may cause loss of information
How territories should be priced
Step 1: Estimate territory relativities using spatial smoothing and by including the rest of the classification model in offset
Step 2: Estimate the rest of the classification plan using GLM and by including the territory relativities in offset
Iterate steps 1 and 2 until both converge
Impact of choosing a level with fewer observations as the base level of a categorical variable
- higher standard error and p-value => wider confidence intervals around the estimated coefficients
- but the predicted relativities will be the same (rebased to the chosen base level)
When using a log link function, why continuous predictor variables should usually be logged and exceptions
Reason:
- should log prediction to allows more flexibility in fitting different curve shapes to the data (if not logged => only allows for exponential growth)
Exceptions:
- year variable (used to pick up trend effects)
- variable containing values of 0 (ln(0) is undefined, unless 1 is added to all observations before taking the log)
3 cautions when plotting actual vs predicted values for model selection
- use holdout data to prevent overfit
- aggregate data before plotting based on percentiles of predicted values
- take the log of all values before plotting to prevent large values from skewing picture
GLM outputs for each predicted coefficient
- standard error:
- Definition: estimated std dev of random process that estimate coefficient
- Use: p-value and confidence interval
- Limitation: ***based of the Cramer-Rao lower bound => could be understated - p-value:
-Definition: probability of an estimated coefficient having its magnitude different than 0 by pure chance given that the true coefficient is 0
Focus: variable significance
- if small p value, variable shoud be included in the model
-Limitation: does not give the probability that the true coefficient is 0
- more observation = smaller p valur
- small dispersion parameter = smaller p value - confidence interval:
- Definition: range of estimates that would not be rejected given a selected threshold for the p-value
- If interval very narrowed , should add the new variable into model
- Focus: variable significance
Problem and options for GLMs with highly correlated variables
Problems:
- unstable model
- erratic coefficients
- high standard errors
Option 1: remove all highly correlated variable except one
- this eliminated the high correlation
- disadvantage: potentially loses some unique information in the eliminated variables
Option 2: use dimensionality-reduction techniques (principal component analysis)
- creates a new subset of uncorrelated variables from the correlated variables by identifying which variables are most predictive ** of the variance between classes***.
- Allow any other highly correlated variable to be removed resulting in a simpler model.
- use this subset of uncorrelated variables in the GLM
- disadvantage: additional time required
- Suited for developing individual, aggregate variables that summarize signal.
Define multicollinearity and give a way to detect it
Definition:
-two or more predictor are strongly predictive of a third predictor => near-perfect linear dependency among 3 or more predictors
Problem:
- erratic coefficients
- unstable model
- model may even not converge
Detection: use variance inflation factor (VIF)
how much the squared standard error for the predictor is increased due to the presence of collinearity with other predictors. It is determined by running a linear model for
each of the predictors using all the other predictors as inputs,*
-if VIF>10 => variable has multicollinearity
Define aliasing and its solutions
Definition:
-perfect **linear **dependency among predictor variables
Problem:
-model will never converge
Solutions:
- Manually: remove or reclassify aliased records in another factor level
- GLM softwares: automatically remove one of the aliased variables