GLM2 Flashcards

Question 1

Q

things to check for when cleaning the data for a GLM

Answer

A

Check for duplicate records and remove them
Check categorical ﬁeld values against documentation (i.e., are there code values not in the documentation, and are these new codes or errors?)
Check reasonability of numerical ﬁelds (e.g., negative premiums, signiﬁcant outliers)
Decide how to handle errors and missing values (e.g., how much time to investigate, anything systematic about these records such as a speciﬁc location, maybe discard these records or replace the bad values with average values or an error ﬂag)
Convert continuous variables into categorical? Group levels in categorical variables? Combine or separate variables?

Question 2

Q

how adding additional predictor variables in a model will impact the ﬁt of a GLM on both the training dataset and on the testing dataset

Answer

A

Adding more variables to a model will always cause the model to ﬁt the training dataset better since they provide more freedom for the model to ﬁt that data. However, these additional variables only add predictive power to the model on other datasets (such as the testing dataset) up to a point, after which they are being ﬁt to the noise in the training dataset in addition to the signal, and as such are less useful when applied to new datasets

Question 3

Q

why a separate dataset should be used for testing a GLM than the dataset used to train the GLM

Answer

A

Adding additional variables to a GLM will always result in a better ﬁt on the training dataset. As such, to test the predictive power of the model, we need to use a separate testing dataset

Question 4

Q

2 approaches that can be used to split individual records from a ratemaking dataset into a training dataset and a testing dataset

Answer

A

Records can be split on a time basis or randomly.

A time basis means that the records from a certain time period (e.g., certain accident years) would go in the training dataset and the remaining records would go in the testing dataset.

The advantage of splitting using time is that the same weather events would be in both datasets if split randomly, which can result in over-optimistic validation results.

Question 5

Q

2 advantages and 2 disadvantages of modeling frequency and severity separately instead of modeling pure premium directly

Answer

A

Advantages: • Modeling frequency and severity separately allows you to gain more insight and intuition about the impact of each predictor variable.

Each of frequency and severity separately is more stable (e.g., a variable that only impacts frequency will look less signiﬁcant in a pure premium model).
Pure premium modeling can lead to overﬁtting if a predictor variable only impacts frequency or severity but not both. For example, if a variable is signiﬁcant for frequency but not for severity, the randomness of the severity of that variable might be considered to be part of the signal instead of part of the random noise.
The Tweedie distribution in a pure premium model assumes both frequency and severity move in the same direction, but this may not be true.

Disadvantages: • Creating a separate model for frequency and severity takes more time since 2 separate models need to be created instead of a single pure premium model.

• The claim level data may not be available to model frequency and severity separately

Question 6

Q

steps needed to combine pure premium models for different perils instead a single all-peril pure premium model

Answer

A

i. Run each peril model separately to get expected losses from each peril for the same group of exposures.
ii. Aggregate the expected losses across all perils for all observations.
iii. Run a model using the all-peril loss cost as the target variable and the union of all predictor variables as the predictors. Since this target variable will be more stable, focus on using a dataset that will be more reﬂective of the future mix of business.

Question 7

Q

highest and lowest possible values of the deviance

Answer

A

For a given dataset and error distribution, the lowest possible value of the deviance is 0, which would be the deviance for a saturated model with one predictor variable for every observation in the dataset.

highest possible value of the deviance occurs when there are no predictor variables, in which case the deviance represents the total deviance inherent in the data

Question 8

Q

what conditions need to exist for deviance comparisons to be valid between 2 GLMs.

Answer

A

The datasets used for both models must be identical (including the same number of records used in each model).
Both models use the same assumed distribution and dispersion parameter.

Question 9

Q

sample plot of deviance residuals for a model if the deviance residuals are left-skewed

Question 10

Q

sample plot of deviance residuals for a model if the deviance residuals are right-skewed

Question 11

Q

deviance residuals

Answer

A

deviance residual is the amount that a given observation contributes to the deviance.

It is effectively the residual adjusted for the shape of the GLM distribution, so the deviance residual distribution will be normally distributed if the assumed GLM distribution is correct

Question 12

Q

3 options for measuring model stability

Answer

A

The inﬂuence of an individual record on the model can be measured using the Cook’s distance, which can be calculated by most GLM software. Records with the highest Cook’s distance should be given additional scrutiny as to whether they should be included in the dataset or not.
Cross-validation can be used to assess model stability by comparing in-sample parameter estimates across different model runs.
Bootstrapping can be used to create new datasets with the same number of records by randomly sampling with replacement from the original dataset. The model can then be reﬁt on many different datasets and we can get statistics like the mean and variance for each parameter estimate

Question 13

Q

model building process

Answer

A

The ﬁrst step is to set goals and objectives for the project, such as deﬁning the end product, setting timelines, and identifying resources.

The second step is to communicate with key stakeholders such as legal, IT, and underwriters to identify their requirements and concerns for the project.

The third step is to collect and process the data, which includes data cleansing and splitting the data for testing.

The fourth step is exploratory data analysis to understand the data and potential relationships between variables.

The ﬁfth step is to specify the form of the model, which in a GLM would include identifying the target variable and link function.

The sixth step is to evaluate the model output for each variable and in total and make adjustments to the model as needed.

The seventh step is to validate the model by testing it on a holdout dataset and picking the optimal model. The eighth step is to translate the model results into a product, such as a ﬁnal rating plan.

The last step is to maintain and rebuild the model as needed as predictive value changes over time

Question 14

Q

impact on the speciﬁcity and sensitivity of increasing the discrimination threshold

Answer

A

Increasing the discrimination threshold will result in fewer true positives and more true negatives.

Since these are the numerators of sensitivity and speciﬁcity respectively, these will cause the sensitivity of the model to decrease and the speciﬁcity of the model to increase

Question 15

Q

why coverage related variables should not be priced using GLMs

Answer

A

Coverage related variables (such as deductibles or limits) in GLMs can give counterintuitive results, such as indicating a lower rate for more coverage. This could be due to correlations with other variables outside of the model, including possible selection effects (e.g., insureds self-selecting to higher limits since they know they are higher risk, underwriters forcing high risk insureds to have higher deductibles).

Charging rates for coverage options that reﬂect anything other than pure loss elimination could lead to changes in insured behavior, which means the indicated rates based on past experience would no longer be expected to be appropriate for new policies. As such, rates for coverage options should be estimated outside of the GLM ﬁrst and included in the GLM as offset terms

Question 16

Q

how territory should be modeled in conjunction with GLMs

Answer

Study These Flashcards

A

Territories are challenging in GLMs since there may be a very large number of territories, and aggregating them into a smaller number of groups may cause you to lose important information.

Techniques like spatial smoothing can be used to price territories, and then territorial rates can be included in the GLM with the offset terms. However, the territory model should also be offset for the rest of the classiﬁcation plan, so the processs hould be iterative until each model converges to an acceptable degree

GLM2 Flashcards

(16 cards)