A2. Generalized Linear Models for Insurance Rating Flashcards

Question

Define underfitting and overfitting

Answer 1

Underfitting: - too few parameters - does not use enough of the useful info - does not capture enough of the signal Overfitting: - too many parameters (or too few obs) - captures too much of the noise

Answer 2

holdout sample is a separate sample of data not used in training if poorly predicting the holdout sample results 1. model has been overfit to the training sample data 2. OR has poor predictive power in general USE : 1. Validate how well the results of a predictive model generalize to other data: -To test how accurately the predictive model will perform in practice Test the predictive power/stability of a model 2. Prevent overfitting: model will always fit the training better with more paramteters but it also could pick up random variation as a predictive variable -this is particularly likely when (1) the size of the training data set is small or (2) when the number of parameters in the model is large -make sure model has not captured noise of training during fitting procedure

Answer 3

1. unbiased sample from the same population as the training dataset *** 2. large enough to fit to a model 3. can select randomly or split from training data by time - if split randomly => any weather event will be in both training and test => over-optimistic validation results - if split using time ( even vs odd) => better since training and holdout will be equally impacted by seasonality and trends=> realistic validation results. if large waether event, the event will only be included in one of the two sets and we wont get overly optimistic results that would occur if the event affected both sets.

Answer 4

number of parameters that need to be estimated Model complexity is defined by its degrees of fredom. A more complex model = model will more DF

Answer 5

GLM estimation: - maximising the log-likelihood - or minimizing the deviance (equivalent) Impact of number of variables: -adding more variables will always increase the log-likelihood (and reduce the deviance) because there is more freedom to **explain randomness of outcomes from non-systematic effects**. - since GLM maximize the log-likelihood, they will always use the additional variables - however these additional variables may only catch up noise of training set and therefore deteriorate model performance on testing set

Answer 6

1. Beta test: - test if the parameter is significantly different than 0 (or relativity different than 1) - threshold defined by student test if using betas, or chi-square test if using relativities 2. Deviance test: - test if the inclusion of the additional variable improves model significantly (or decrease deviance enough) - threshold defined by chi-square test **if scale parameter is known**, or F test if scale parameter is unknown

Answer 7

Standard conditions: - same number of records in datasets - same distribution - same dispersion parameter** If we want to compare models using the Fisher test, additional condition: -one model must be a subset of the other

Answer 8

does not adjust for discrete nature of those distributions

Answer 9

1. Train and test: - split data into a single training set and a single test set 2. Train, validate and test: - split data into 3: a training set, a validation set and a test set - validation set used to refine and tweak the model 3. Coss-validation: ** for all strategies : Only use test data set when model is complete If too many decisions are made based on test set, it is effectively a training set (leads to overfitting)

Answer 10

- some models may be proprietary | - final model may be a business decision instead of a technical decision

Answer 11

- if there is significant corr in data => aliasing problems => model will not converge - need to select an error structure => not clear which to use - if new product => may not have losses => no response variable to fit a GLM **

Answer 12

1. GLMs give full credibility: - the estimated coefficients are not credibility-weighted to recognize low volumes or high variability - solution: look at standard errors and p-values ** 2. GLMs assume randomness of outcomes are uncorrelated: ** - if several renewals of the same policy in dataset => the same insured is likely to have correlated outcomes - if there are weather events in dataset => the same weather event is likely to cause similar outcomes to risks in the same areas

Answer 13

1. Run each peril model separately to get expected losses from each peril for the same group of exposures 2. Aggregate the expected losees across all perils for all observations 3. Run a model using the all-peril loss cost as the target variable and the union of all predictors as the predictors. 4. This target variable will be more stable since volatility was fit away, therefore you can focus on using only data reflecting the future mix of business (latest year instead of all historical years)

Answer 14

Reasons: - some models will over-predict for some segments of the book - other models will under-predict for other segments of the book - therefore using an average can balance the predictions for all segments Conditions: - the errors of the models must be as uncorrelated as possible (otherwise they will over-predict the same segments at the same time) - therefore the models shoudl be build separately, by different people, without sharing information

Answer 15

- claims matching: need to match specific vehicles or specific coverages? - record matching: are there timing differences between the datasets? - time dimension: is the level of aggregation CY or AY consistent between datasets before merging? - fields required/necessary: are there fields not needed or desired fields not present?

Answer 16

- check for duplicate records => remove them - check for codes of categorical fields against documentation => document new codes or correct errors - check for reasonability of numerical fields => correct negative premiums or significant outliers - check for convergence problems/confusing results => handle errors/missing values by replacing them by average values/error flag/reclassify them/remove them - check for binning of continuous fields

Answer 17

- capping large losses - removing cats (or giving less weight) - developing losses to ultimate (or add a year variable to pick up trends/development) - trending exposures and losses (or add a year variable to pick up trends/development) - on-leveling premiums

Answer 18

- When a single observation contains grouped information - Different observations represent different time periods If neither apply, weights are all 1

Answer 19

economic value of the model | ability to prevent adverse selection **

Answer 20

Used for logistic model *** shows the trade off between true positive and false positive rate for different discrimination threshold. A better model will be pushed out further from the line of equaliy meaning that a small increase in the false positive rate will yield a larger increase in the true positive rate when the treshold is lowered

Answer 21

- makes glm awares of non modeled rating factor so that they are reflected in the model and estimated coefficient for the new variable are optimal in their presence. if there is multiple offset , can simply be added together into a total offset

Answer 22

Offset is useful for deductible, which is better estimated outside GLMs (e.g. LER analysis), since GLM often produces counterintuitive results due to effect of selection and correlation with variables outside model. Territory rating is impractical to use in a GLM since there are hundreds or even thousands of territories with no easy way to group them without losing signal. However, territory differences are significant so it’s important that the rating plan be offset for territory rates. Thus it’s best to include territory factors as an offset in GLM. If you’re creating a model on renewal business after having already made a model for new business only, you would likely use an offset for many of the variables. This would ensure consistency between the sets of business that you do not expect to change over time. When including the effect of a coverage limit in a pure premium model. Limits may be correlated with other covariates not being accounted for in the model and this might lead to inconsistent ILFs based on model results, so it’s better to do loss elimination analysis outside of the modeling process and include the effect of a coverage limit as an offset. ****when the target variable varies by a exposure base exemple : claim count per policy is the tharget policies have different policy length in years a policy of 2 years will have an offset of ln(2)

Answer 23

Intercept represents all variable at their base levels *** -> easier to interpret Sign of interacted variable > When variable is not centered, sometimes a coefficient may have the opposite sign than expected, this is especially true when an interaction term is present, so the coefficients are more intuitive to understand when centering variables

Answer 24

method of moments maximum likelihood minimum chi squared minimum distance ***

A2. Generalized Linear Models for Insurance Rating Flashcards

(48 cards)