A.2. Generalized Linear Models for Insurance Rating Flashcards

1
Q

GLM random component

A

Each yi is assumed to be independent and to come from the exponential family of distributions with mean µi and variance Var(yi) = φV(µi)/ωi

  • φ is called the dispersion parameter and is a constant used to scale the variance.
  • V(µ) is called the variance function and is given for a selected distribution type. It describes the relationship between the variance and mean. Note that the same distribution type (e.g., Poisson) must be assumed for all observations.
  • ωi are known as weights and assign a weight to each observation i.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

GLM systematic component

A

g(µi) = β0 + β1xi1 + β2xi2 + · · · + βpxip + offset
• The right hand side is known as the linear predictor.
• The offset term is optional and allows you to manually
specify the estimates for certain variables (usually based on other analyses).
• The x predictor variables can be binary (as for levels
of categorical variables) or continuous, or even
transformations or combinations of other variables.
• g(µ) is called the link function, and allows for
transformations of the linear predictor.
• β0 is called the intercept term, and the other β’s are called the coefficients of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantages of multiplicative rating plans

A
  • Simple and practical to implement.
  • They guarantee positive premiums (not true for additive terms).
  • Impact of risk characteristics is more intuitive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variance Functions for exponential family

distributions

A
Distribution Variance Function
Normal V(µ) = 1
Poisson V(µ) = µ
Gamma V(µ) = µ^2
Inverse Gaussian V(µ) = µ^3
Negative Binomial V(µ) = µ(1 + κµ)
Binomial V(µ) = µ(1 − µ)
Tweedie V(µ) = µ^p
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Choices for Severity distributions

A

In insurance data, claim severity distributions tend to be
right-skewed and have a lower bound at 0. Both the Gamma and Inverse Gaussian distributions exhibit these properties, and as such are common choices for modeling severity. The Gamma distribution is the most commonly used, but the Inverse Gaussian has a sharper peak and wider tail, so it is more appropriate for more skewed severity distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Choices for Frequency distributions

A

Claim frequency is most often modeled using a Poisson
distribution. The GLM implementation of Poisson allows
for the distribution to be continuous instead of discrete.
Technically, the overdispersed Poisson is recommended, which allows φ to be different than 1, and thus allows the variance to be greater than the mean (instead of being equal as with the typical Poisson).

Another choice for frequency modeling is the Negative
Binomial distribution, which is really just a Poisson
distribution with a parameter that itself has a Gamma
distribution. With the Negative Binomial, φ is restricted to 1, but instead it contains a dispersion parameter κ in its variance function that allows for the variance to exceed the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Relationship between Poisson, Gamma, and Tweedie parameters

A

• Poisson has parameter λ, which equals its mean and
variance
• Gamma has mean αθ and variance αθ^2 , and thus coefficient of variation 1/ √α
• Tweedie has mean µ = λ × (αθ) and variance φµ^p
• p = (α+2)/(α+1) , so it depends entirely on the Gamma coefficient of variation
• The Tweedie dispersion parameter is
φ =[λ^(1−p)×(αθ)^(2−p) ] / (2−p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Logit and Logistic Functions

A

Logit: g(µ) = ln [µ/(1−µ)]
. The ratio of µ/(1−µ) is known as the odds (e.g., a thousand to one).

Logistic function (inverse of logit): 1/(1+e^(−x))
.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why continuous predictor variables should usually be logged and exceptions

A

Continuous variables should usually be logged when a log link function is used to allow GLMs flexibility in fitting
different curve shapes to the data (other than just exponential growth).

Exceptions to the general rule of logging a continuous
predictor variable exist such as using a year variable to pick up trend effects. Also, if the variable contains values of 0, an adjustment such as adding 1 to all observations must first be made since ln(0) is undefined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Impact of choosing a level with fewer observations as the base level of a categorical variable

A

This will still result in the same predicted relativities for that variable (re-based to the chosen base level), but there will be wider confidence intervals around the estimated coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Matrix form of a GLM

A

g(µ) = Xβ, where µ is the vector of µi values, β is the vector of β parameters, and X is called the design matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Degrees of freedom for a model

A

The degrees of freedom of a model is the number of

parameters that need to be estimated for the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

GLM outputs for each predicted coefficient

A

Standard error

p-value: an estimated probability that the absolute value
of a particular β is at least that different from 0 by pure chance

Confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How number of observations and dispersion parameter impact p-values

A

p-values (and standard errors and confidence intervals) will be smaller with larger datasets that have more observations. They will also be smaller with smaller values of φ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Problem and options for GLMs with highly correlated

variables

A

This can result in an unstable model with erratic coefficients that have high standard errors. Two options for dealing with very high correlation include:

  1. Removing all highly correlated variables except one. This eliminates the high correlation in the model, but it also potentially loses some unique information contained in the eliminated variables.
  2. Use dimensionality-reduction techniques such as principal components analysis or factor analysis to create a new subset of variables from the correlated variables, and use this subset of variables in the GLM. The downside is the additional time required to do this extra analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define multicollinearity and give a way to detect it

A

Multicollinearity occurs when there is a near-perfect linear dependency among 3 or more predictor variables. For example, suppose x1 + x2 ≈ x3. This is more difficult to detect since both x1 and x2 may not be individually highly correlated with x3. When multicollinearity is present in a model, the model may become unstable with erratic coefficients, and it may not converge to a solution. One way to detect multicollinearity is to use the variance inflation factor
(VIF) statistic, which is given for each predictor variable, and measures the impact on the squared standard error for that variable due to collinearity with other predictor variables by seeing how well other predictor variables can predict the variable in question. VIF values of 10 or greater are considered high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define aliasing and how GLM software deals with it

A

When there is a perfect linear dependency among predictor variables, those variables are aliased. The GLM will not converge in this case, but most GLM software will detect this and automatically remove one of those variables from the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

2 important limitations of GLMs

A
  1. GLMs give full credibility: The estimated coefficients are not credibility-weighted to recognize low volumes of data or high volatility. This concern can be partially addressed by looking at p-values or standard errors.
  2. GLMs assume that the randomness of outcomes are
    uncorrelated: Two examples of violations of this are:

• Using a dataset with several renewals of the same policy, since the same insured over different renewals is likely to have correlated outcomes.

• When the data can be affected by weather, the same
weather events are likely to cause similar outcomes to
risks in the same areas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Components of model-building process

A
  1. Setting goals and objectives
  2. Communication with key stakeholders
  3. Collecting and processing the data
  4. Conducting exploratory data analysis
  5. Specifying the form of the model
  6. Evaluating the model output
  7. Validating the model
  8. Translating the model results into a product
  9. Maintaining and rebuilding the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Considerations in merging policy and claim data

A

•Matching claims to specific vehicles/drivers (for auto) or
specific coverages.

• Are there timing differences between the datasets? How often is each updated? Timing differences can cause record matching problems.

• Is there a unique key to merge the data (e.g., policy
number)? There is the potential for orphaned claims if there is no matching policy record, or duplicating claims if there are multiple policy records.

• Level of aggregation before merging? Time dimension
(e.g., CY)? Policy level versus claimant/coverage level?For commercial, location level or policy level?

• Are there fields not needed? Are there fields desired that are not present?

21
Q

Considerations in Modifying the Data

A
  • Check for duplicate records and remove them
  • Check categorical field values against documentation (i.e., are there code values not in the documentation, and are these new codes or errors?)

• Check reasonability of numerical fields (e.g., negative
premiums, significant outliers)

• Decide how to handle errors and missing values (e.g., how much time to investigate, anything systematic about these records such as a specific location, maybe discard these records or replace the bad values with average values or an error flag)

• Convert continuous variables into categorical (called
binning)? Group levels in categorical variables? Combine or separate variables?

22
Q

Other possible data adjustments before modeling

A
  • Capping large losses
  • Removing cats or giving them less weight
  • Developing losses
  • On-leveling premiums for LR models
  • Trending exposures and losses
23
Q

Purpose of using a separate dataset for testing

A

After we build a model on a set of data, it would be
inappropriate to test the model on the same set of data since that would give us biased results of the model performance. More variables in the model will always cause the model to fit the training data better, but the model may not fit other datasets better since it implicitly begins assuming that the random noise in the data is really part of the systematic signal. We want to pick up as much signal as possible with minimal noise. As such, before we build our model, we will want to split the data into at least 2 parts: the training set and the test (aka holdout) set.

24
Q

List 3 Model Testing Strategies

A

• Train and test: Split data into a single training set and a
single test set. Can split randomly or on time basis. The
advantage of splitting using time is that the same weather events would be in both datasets if split randomly, which can result in over-optimistic validation results.

  • Train, validate, and test: Split data into 3 - a training set, a validation set, and a test set. The validation set can be used to refine the model and make tweaks, but the test set should still be left until the model is final.
  • Cross-validation: This is less common in insurance since variables are often hand-picked. There are different ways to do cross-validation, but the most common is called k-fold cross validation
25
Q

Steps for k-fold cross-validation

A
  1. Pick a number k (e.g., 10) and split data into k groups (called folds). Split can be random or based on time.
  2. For each fold, train the model using the other k − 1 folds, and test the model using this kth fold.
26
Q

How to combine frequency and severity models into a

pure premium model

A

When both have log link functions, you can multiply the
corresponding relativities together (or you can just add their linear predictors together to get the model in equation form).

27
Q

2 disadvantages of modeling frequency and severity

separately

A

It requires the available detailed data and it takes more time to build 2 models.

28
Q

Advantages of modeling frequency and severity

separately

A
  • Gaining more insight and intuition about the impact of each predictor variable.
  • Each of frequency and severity separately is more stable (e.g., a variable that only impacts frequency will look less significant in a pure premium model).

• Pure premium modeling can lead to overfitting if a
predictor variable only impacts frequency or severity but
not both, since the randomness of the other component may be considered a signal effect.

• The Tweedie distribution in a pure premium model
assumes both frequency and severity move in the same
direction, but this may not be true.

29
Q

Steps to combine separate models by peril/coverage

A
  1. Run each peril model separately to get expected losses from each peril for the same group of exposures.
  2. Aggregate the expected losses across all perils for all
    observations.
  3. Run a model using the all-peril loss cost as the target
    variable and the union of all predictor variables as the
    predictors. Since this target variable will be more stable,
    focus on using a dataset that will be more reflective of the future mix of business (e.g., the latest year instead of several years worth of data).
30
Q

Criteria for variable inclusion in a GLM

A

If our intent in building the GLM is just to update the rates for the existing rating algorithm, then we only want to use existing rating variables in our model.

Otherwise, the criteria for variable inclusion will include
statistical significance (e.g., p values), the cost-effectiveness of collecting data for the variable, actuarial standards of practice and legal requirements, and whether the quotation system can include the variable.
31
Q

Formula for partial residuals

A

ri = (yi − µi)g’(µi) + βjxij

With the log link function g(µi) = ln(µi), we have g’(µi) = 1/µi, so the above becomes ri = (yi−µi)/µi + βjxij

32
Q

Possible transformations after reviewing partial

residual graph

A

• Binning the variable: i.e., turning it into a categorical
variable with separate “bins”. Downsides include that this increases the degrees of freedom of the model, it can result in inconsistent and/or impractical patterns, and variation within bins is ignored.

• Adding polynomial terms: i.e., xj^2, xj^3, etc. Drawback is loss of interpretability without a graph.

• Add piecewise linear functions: Add hinge functions
max(0,xj − c) at each break point c. Drawback is break
points must be manually chosen.

33
Q

2 measures used in diagnostic tests of overall model

A

• Log-likelihood: This is the log of the product of
the likelihood for all observations using the model
(or equivalently, the sum of the log-likelihood for all
observations). For a given dataset, it is bound between
the lowest possible log-likelihood of the null model (no
predictors) and the highest possible log-likelihood of the saturated model (1 predictor for each observation).

• Deviance = 2 × (llsaturated − llmodel)
= 2 ×∑i=1 ln f(yi|µi = yi) − ln f(yi|µi = µi)

Note that adding more variables to a model always
increases log-likelihood and reduces deviance since there is more freedom to fit the data.

34
Q

Required conditions for a valid comparison of

deviance between models

A

Identical datasets
Same distribution
Same dispersion parameter

35
Q

F statistic and table value

A

F = (Ds−Db)/[ (# of added parameters)×φs]

The bigger model is considered better at a given significance level if F is larger than the F distribution table value F (# of added parameters, n−ps)

36
Q

AIC and BIC formulas

A
AIC = −2 × ll + 2p
BIC = −2 × ll + p ln(n)

Smaller values are better

37
Q

Deviance Residuals formula and implications

A

ln f(yi|µi = yi) − ln f(yi|µi = µi)

This is changed to be a negative value when yi < µi
The deviance residual is the amount that a given observation contributes to the deviance. In a well-fit model, the deviance residuals will follow no predictable pattern and will be normally distributed with constant variance.

38
Q

3 options for measuring model stability

A

• The influence of an individual record on the model can
be measured using the Cook’s distance, which can be
calculated by most GLM software. Records with the highest Cook’s distance should be given additional scrutiny as to whether they should be included in the dataset or not.

  • Cross-validation can be used to assess model stability by comparing in-sample parameter estimates across different model runs.
  • Bootstrapping can be used to create new datasets with the same number of records by randomly sampling with replacement from the original dataset. The model can then be refit on many different datasets and we can get statistics like the mean and variance for each parameter estimate.
39
Q

2 reasons that model refinement techniques are not

appropriate for model selection

A
  1. Some of the models may be proprietary.

2. The decision on the final model may be a business decision and not a technical one.

40
Q

3 cautions when plotting actual vs predicted values

for model selection

A
  • Use holdout data (to prevent overfit)
  • It can help to aggregate data before plotting if the dataset is very large (e.g., into 100 buckets based on percentiles of predicted values)
  • Taking the log of all values before graphing prevents large values from skewing the picture
41
Q

Steps to create a simple quantile plot

A
  1. Sort the (holdout) dataset based on that model’s predicted loss costs.
  2. Bucket the data into quantiles with each quantile having equal exposures.
  3. Calculate the average predicted loss cost and average actual loss cost for each bucket and plot them on a graph. For ease of interpretation, it can be helpful to divide both values by the overall average predicted loss cost.
42
Q

3 criteria for choosing winning model from simple

quantile plots

A
  1. Predictive accuracy: Difference between actual and
    predicted in each quantile.
  2. Monotonicity: The actual pure premium should
    consistently increase across quantiles.
  3. Vertical distance of actual loss cost between first and last quantiles: This indicates how well the model distinguishes between the best and worst risks.
43
Q

Steps to create a double lift chart

A
  1. For each observation, calculate sort ratio = model 1
    predicted loss cost / model 2 predicted loss cost.
  2. Sort the data by sort ratio in ascending order.
  3. Bucket the data into quantiles with equal exposures.
  4. Calculate the average predicted loss cost for each model and average actual loss cost for each bucket, divide each by the overall average loss cost from that source, and plot the quantities on a graph.
44
Q

Steps to create a loss ratio chart

A
  1. Sort the (holdout) dataset based on that model’s predicted loss costs.
  2. Bucket the data into quantiles with each quantile having equal exposures.
  3. Calculate the actual loss ratio (based on the current rating plan, not on the model) for each bucket and plot them on a graph.
45
Q

Steps to create the Gini index

A
  1. Sort the (holdout) dataset based on that model’s predicted loss costs.
  2. Plot a graph with the x-axis being the cumulative percent of exposures and the y-axis being the cumulative percent of actual losses.

Gini index = 2 × area between Lorenz curve and line of
equality

46
Q

Calculation of sensitivity, specificity, and false

positive rate

A

Sensitivity = True positives / Total event occurrences

Specificity = True negatives / Total event non-occurrences

False positive rate = 1 - Specificity

47
Q

Why coverage related variables should be first priced

outside of GLMs

A

Coverage related variables (such as deductibles or limits) in GLMs can give counterintuitive results, such as
indicating a lower rate for more coverage. This could
be due to correlations with other variables outside of the model, including possible selection effects (e.g., nsureds
self-selecting to higher limits since they know they are higher risk, underwriters forcing high risk insureds to have higher deductibles). Charging rates for coverage options that reflect anything other than pure loss elimination could lead to changes in insured behavior, which means the indicated rates based on past experience would no longer be expected to be appropriate for new policies. As such, rates for coverage
options should be estimated outside of the GLM first and
included in the GLM as offset terms

48
Q

Why territories should be priced outside of GLMs

A

Territories are challenging in GLMs since there may be a
very large number of territories, and aggregating them into a smaller number of groups may cause you to lose important information. Techniques like spatial smoothing can be used to price territories, and then territorial rates can be included in the GLM with the offset terms.
However, the territory model should also be offset for the rest of the classification plan, so the process should be iterative until each model converges to an acceptable degree.

49
Q

Why ensemble models can offer improved predictions

A

Different models will over-predict and under-predict for
different segments of the book, but using an average of
multiple models helps balance these predictions out for
those segments. However, this really only works when the model errors are as uncorrelated as possible, which generally happens when models are built separately by different people with little or no sharing of information.