GLM2 Flashcards

1
Q

things to check for when cleaning the data for a GLM

A
  • Check for duplicate records and remove them
  • Check categorical field values against documentation (i.e., are there code values not in the documentation, and are these new codes or errors?)
  • Check reasonability of numerical fields (e.g., negative premiums, significant outliers)
  • Decide how to handle errors and missing values (e.g., how much time to investigate, anything systematic about these records such as a specific location, maybe discard these records or replace the bad values with average values or an error flag)
  • Convert continuous variables into categorical? Group levels in categorical variables? Combine or separate variables?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how adding additional predictor variables in a model will impact the fit of a GLM on both the training dataset and on the testing dataset

A

Adding more variables to a model will always cause the model to fit the training dataset better since they provide more freedom for the model to fit that data. However, these additional variables only add predictive power to the model on other datasets (such as the testing dataset) up to a point, after which they are being fit to the noise in the training dataset in addition to the signal, and as such are less useful when applied to new datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why a separate dataset should be used for testing a GLM than the dataset used to train the GLM

A

Adding additional variables to a GLM will always result in a better fit on the training dataset. As such, to test the predictive power of the model, we need to use a separate testing dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 approaches that can be used to split individual records from a ratemaking dataset into a training dataset and a testing dataset

A

Records can be split on a time basis or randomly.

A time basis means that the records from a certain time period (e.g., certain accident years) would go in the training dataset and the remaining records would go in the testing dataset.

The advantage of splitting using time is that the same weather events would be in both datasets if split randomly, which can result in over-optimistic validation results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

2 advantages and 2 disadvantages of modeling frequency and severity separately instead of modeling pure premium directly

A

Advantages: • Modeling frequency and severity separately allows you to gain more insight and intuition about the impact of each predictor variable.

  • Each of frequency and severity separately is more stable (e.g., a variable that only impacts frequency will look less significant in a pure premium model).
  • Pure premium modeling can lead to overfitting if a predictor variable only impacts frequency or severity but not both. For example, if a variable is significant for frequency but not for severity, the randomness of the severity of that variable might be considered to be part of the signal instead of part of the random noise.
  • The Tweedie distribution in a pure premium model assumes both frequency and severity move in the same direction, but this may not be true.

Disadvantages: • Creating a separate model for frequency and severity takes more time since 2 separate models need to be created instead of a single pure premium model.

• The claim level data may not be available to model frequency and severity separately

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

steps needed to combine pure premium models for different perils instead a single all-peril pure premium model

A

i. Run each peril model separately to get expected losses from each peril for the same group of exposures.
ii. Aggregate the expected losses across all perils for all observations.
iii. Run a model using the all-peril loss cost as the target variable and the union of all predictor variables as the predictors. Since this target variable will be more stable, focus on using a dataset that will be more reflective of the future mix of business.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

highest and lowest possible values of the deviance

A

For a given dataset and error distribution, the lowest possible value of the deviance is 0, which would be the deviance for a saturated model with one predictor variable for every observation in the dataset.

highest possible value of the deviance occurs when there are no predictor variables, in which case the deviance represents the total deviance inherent in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what conditions need to exist for deviance comparisons to be valid between 2 GLMs.

A
  • The datasets used for both models must be identical (including the same number of records used in each model).
  • Both models use the same assumed distribution and dispersion parameter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

sample plot of deviance residuals for a model if the deviance residuals are left-skewed

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

sample plot of deviance residuals for a model if the deviance residuals are right-skewed

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

deviance residuals

A

deviance residual is the amount that a given observation contributes to the deviance.

It is effectively the residual adjusted for the shape of the GLM distribution, so the deviance residual distribution will be normally distributed if the assumed GLM distribution is correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

3 options for measuring model stability

A
  • The influence of an individual record on the model can be measured using the Cook’s distance, which can be calculated by most GLM software. Records with the highest Cook’s distance should be given additional scrutiny as to whether they should be included in the dataset or not.
  • Cross-validation can be used to assess model stability by comparing in-sample parameter estimates across different model runs.
  • Bootstrapping can be used to create new datasets with the same number of records by randomly sampling with replacement from the original dataset. The model can then be refit on many different datasets and we can get statistics like the mean and variance for each parameter estimate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

model building process

A

The first step is to set goals and objectives for the project, such as defining the end product, setting timelines, and identifying resources.

The second step is to communicate with key stakeholders such as legal, IT, and underwriters to identify their requirements and concerns for the project.

The third step is to collect and process the data, which includes data cleansing and splitting the data for testing.

The fourth step is exploratory data analysis to understand the data and potential relationships between variables.

The fifth step is to specify the form of the model, which in a GLM would include identifying the target variable and link function.

The sixth step is to evaluate the model output for each variable and in total and make adjustments to the model as needed.

The seventh step is to validate the model by testing it on a holdout dataset and picking the optimal model. The eighth step is to translate the model results into a product, such as a final rating plan.

The last step is to maintain and rebuild the model as needed as predictive value changes over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

impact on the specificity and sensitivity of increasing the discrimination threshold

A

Increasing the discrimination threshold will result in fewer true positives and more true negatives.

Since these are the numerators of sensitivity and specificity respectively, these will cause the sensitivity of the model to decrease and the specificity of the model to increase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

why coverage related variables should not be priced using GLMs

A

Coverage related variables (such as deductibles or limits) in GLMs can give counterintuitive results, such as indicating a lower rate for more coverage. This could be due to correlations with other variables outside of the model, including possible selection effects (e.g., insureds self-selecting to higher limits since they know they are higher risk, underwriters forcing high risk insureds to have higher deductibles).

Charging rates for coverage options that reflect anything other than pure loss elimination could lead to changes in insured behavior, which means the indicated rates based on past experience would no longer be expected to be appropriate for new policies. As such, rates for coverage options should be estimated outside of the GLM first and included in the GLM as offset terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

how territory should be modeled in conjunction with GLMs

A

Territories are challenging in GLMs since there may be a very large number of territories, and aggregating them into a smaller number of groups may cause you to lose important information.

Techniques like spatial smoothing can be used to price territories, and then territorial rates can be included in the GLM with the offset terms. However, the territory model should also be offset for the rest of the classification plan, so the processs hould be iterative until each model converges to an acceptable degree