GLM Flashcards

Question

List the 9 steps to build a model Hint: Obeying Simple Directions Elicits Fully Outstanding Very Powerful Model Results

Answer 1

1. setting goals and Objectives 2. communicate with key Stakeholders 3. collect & process Data 4. conduct Explanatory data analysis 5. specify model Form 6. evaluate model Output 7. Validate model 8. translate model results into Product 9. Maintain & Rebuild model

Answer 2

1. Matching claims to specific vehicles/drivers or coverages 2. Are there timing differences between datasets? How often is each updated? Timing diff can cause record matching problems 3. Is there a unique key to merge data (ex: policy number). Potential for orphaned claims or duplicating claims if multiple policy records. 4. Level of aggregation before merging, time dimension (CY vs PY), policy level vs claimant level, location level or per risk level

Answer 3

1. check for duplicate records and remove them prior to aggregation 2. check categorical fields against documentation (new codes, errors) 3. check reasonability of numerical fields (negative premium, outliers) 4. Decide how to handle errors and missing values (discard or replace with average values) 5. Convert continuous variables to categorical (bining)

Answer 4

1. Cap large losses and remove cats 2. Develop losses 3. On-level premiums 4. Trend exposures and losses 5. Use time variable in model to control these effects (not as good as other adjustments), e.g.: group ages by range

Answer 5

Would be inappropriate since would give biased results of model performance More variables will always cause model to fit training data better (overfitting) but may not fit other datasets better since begins assuming random noise in data is part of systematic component We want to pick as much signal as possible with minimal noise

Answer 6

1. Train & test Split data into 1 training set and 1 testing set (usually 60/40 or 70/30) Can split randomly or on time basis Adv of time: weather events not in both datasets so results are not overly optimistic 2. Train, validate & test Split data into 3 sets. Validation set can be used to refine model and tweak before test set. (40/30/30) 3. Cross validation Most common is k-fold. Pick number k and split data into k groups For each fold, train model using k-1 folds and test model using kth fold. Tend to be superior since more data used in both training and testing but extremely time-consuming

Answer 7

1. Gain more insight and intuition about impact of each predictor variable 2. Each is more stable (variable that only impacts freq will look less significant in pure premium model) 3. PP can lead to overfitting if predictor variables only impact freq or sev but not both since randomness of other may be considered signal effect 4. Tweedie distribution assumed both freq and sev move in same direction which may not be true

Answer 8

1. Requires data to be available 2. Takes more time to build 2 models

Answer 9

1. Significance: we want to be confident effect of var is result of true relationship between predictor and target and not due to noise in data 2. Cost-effectiveness of collecting data for variable 3. Conformity with actuarial standards of practice and regulatory requirements 4. Whether the quotation system can include the variable

Answer 10

ri = (yi - ui)*g'(ui) + beta(j)*xij if log link: (yi - ui)/ui + beta(j)*xij Then they can be plotted against xj and line y = beta(j)*xj can be drawn to see how well line matches residual points

Answer 11

We will want to transform the predictor variable in one of 4 ways: 1. Bining variables: turning into categorical variable with separate bins Removes need to constrain y to any particular shape, but increases df which can result in inconsistent or impractical patterns 2. Add polynomial terms (ex x^2, x^3) Loss of interpretability without graph and can behave erratically at edges of data. 3. Add piecewise linear function (ex: hinge function) Allows fit of wide range of non-linear patterns and easy to interpret, but must manually chose breakpoints and increase df. 4. Add natural cubic splines Combines piecewise with polynomials which better fits data and smooth curve, but need graph to interpret.

Answer 12

When there is a response correlation between predictor variables (ex: when gender affected losses below a certain age)

Answer 13

1. Other coefficients are easier to interpret, particularly true if interaction terms exist 2. Intercept becomes more intuitive as avg frequency at base level 3. Avoids counter-intuitive signs of coefficients when interaction 4. Lower p-values for variable significance (tighter CI)

Answer 14

1. Log-likelihood Log of product of likelihood for all observations (sum of log-likelihood) 2. Scaled deviance D* = 2*(llsaturated - llmodel) Unscaled D = dispersion parameter x D* GLMs are fit by maximizing ll so D is minimized

Answer 15

1. Same dataset is used to fit the 2 models 2. Assumed distribution and dispersion param same for the 2 models

Answer 16

1. F-Test F = (Ds - Db) / n*phi Ds is unscaled dev of smaller model Db is unscaled dev of bigger model n is number of added parameters F(test) > F(table, dfnum = n, dfdenom = n-pb) means we prefer bigger model For non-nested models, use penalized measures of fit: 2. AIC = -2*ll + 2p 3. BIC = -2*ll + p*ln(n) Lower is better

Answer 17

1. Raw residual (ex: yi - ui) 2. Deviance residuals (2*phi*ln(f(yi given yi=ui)) - ln(f(yi given ui=ui))^0.5 Take negative if yi < ui Residual adjusted for shape of GLM Should follow normal distribution with no predictable pattern homoscedasticity: normally distributed with constant variance 3. Working residuals wri = (yi - ui)*g'(ui) if log link wri = (yi - ui)/ui if logic wri = (yi - ui) / ui(1-ui) Critical to bin residuals before analysis

Answer 18

1. Cook's distance Higher indicates higher level of influence. Records with highest value should be given additional scrutiny as to whether they should be included. 2. Cross-validation Comare in-sample parameter estimates across model runs. Model should produce similar results when run on separate subsets of initial dataset. 3. Bootstrapping Create a new dataset with same number of records by randomly sampling with replacement from original dataset. Model can then be refit on different datasets and can get statistics like and mean var for each parameter estimate.

Answer 19

1. Some of the models may be proprietary Info on data & detailed form need to be available to evaluate model 2. Final decision is often business call Those deciding may know nothing about predictive modeling and actuarial science

Answer 20

Scoring is the process of using models to make predictions from individual records It can be used in model selection Should always score on holdout sample Then we can use techniques for model selection

Answer 21

1. Plot actual vs predicted The closer the points are to the line y=x, the better the prediction 2. Lift-based measures: Model lift = economic value of model/ability to prevent adverse selection Attempts to visually demonstrate model's ability to charge each insured an actuarially sound rate Requires 2 or more competing models

Answer 22

1. Simple Quintile Plots 2. Double Lift Charts 3. Loss Ratio Charts 4. Gini Index

Answer 23

For each model: a. Sort holdout dataset based on model's predicted loss cost b. Bucket data into quantiles having equal exposures c. Calculate average predicted loss cost & avg actual loss cost for each bucket and plot them on graph. Winning model should be based on 3 criteria: 1. Predictive accuracy More consistently closer to overall average predicted loss cost 2. Monotonicity Actual PP should consistently increase across all quantiles 3. Vertical distance of actuals between 1st and last quantile Indicates how well the model distinguish between best and worst risks.

Answer 24

Compares 2 models on same graph Winning model is one that best matches actual in each quantile

Answer 25

Generally easier to understand since LR is the most commonly-used metric in insurance profitability The greater the vertical distance between lowest and highest LRs, the greater the model does at identifying further segmentation opportunities not present in current plan. We want LRs not equal and increasing monotonically.

Answer 26

Quantify ability to identify best and worst risks Gini index = 2 * area between 2 curves higher value = better

Answer 27

Decreasing discrimination threshold = more true positives and more false positives since more will be investigated

Answer 28

ratio of true positives to total event occurrences also called true positive rate or hit rate

Answer 29

ratio of true negatives to total event non-occurrences false positive rate = 1 - specificity

Answer 30

All possible combinations of sensitivity and 1-specificity for different discrimination thresholds Help determine a target threshold (lower for large risks since we want to spend more time investigating) AUROC is area under ROC curve The higher the AUROC, the better

Answer 31

1. Opportunity to check your work for errors and improve communication skills 2. Transfer knowledge to others that maintain or rebuild model 3. Comply with stakeholder demands (ASOP41)

Answer 32

1. Everything needed to reproduce model (from source data to model output) 2. All assumptions and justifications for all decisions 3. All data issues encountered and resolution 4. Any reliance on external models or external stakeholders 5. Model performance, structure and shortcomings 6. Compliance with ASOP41 or local actuarial standards on communication

Answer 33

Examples: deductibles, limits, covered perils Can give counter-intuitive results in GLM such as indicating lower rate for more coverage. Could be due to correlation with other variables outside of model, including possible selection effects (insured self-selecting higher limits since they know they are higher risks)

Answer 34

Challenging due to their large number and aggregating them may cause you to lose important information. Techniques like spatial smoothing can be included in GLM as offset terms. Territory model should also be offset for rest of classification plan = iterative process until each model converges to acceptable range

Answer 35

Instead of choosing a single model from 2 or more, models can be combined into an ensemble of models (ex: avg of predictions = balances predictions) Only works when model errors are as uncorrelated as possible, which happens when built by different people with little or no sharing.

Answer 36

When there are one covariate for every level of each variable

Answer 37

1. Coverage variables in GLMs can given counter-intuitive results, such as indicating lower rate for more coverage. 2. Charging rates for coverage options that reflect anything other than pure loss elimination could lead to changes in insured behaviour, which means indicated rates based on past experience will no longer be appropriate for new policies.

Answer 38

Missing data can lead to extrinsic alising. This occurs when there are linear dependencies in the observed data because of the nature of the data. Ex: the missing level for prior auto policy will be perfectly correlated with the missing level for home policy. This can lead to convergence problems or confusing results A solution would be to exclude these missing data records, or to reclassify them to an appropriate level

Answer 39

They can simply be added together into a total offset

Answer 40

1. By default, predictions are based on linear relationship to predictors. Solution: GAMs or MARS 2. Unstable when data is thin or when predictors are highly correlated. Solution: GLMMs 3. Full credibility given to data for each coefficient regardless of thinness of data. Solution: DGLMs, GLMMs 4. Random component of outcome is assumed to be uncorrelated among observations. Solution: DGLMs 5. Scale parameter must be constant across all observations. Solution: DGLMs

Answer 41

Make model harder to interpret and explain.

Answer 42

1. Genralized Linear Mixed Models (GLMMs) 2. Double Generalized Linear Models (DGLMs) aka GLMs with Dispersion Modeling 3. Genralized Additive Models (GAMs) 4. Multivariate Adaptative Regressions Splines (MARS) 5. Elastic Net GLMs

Answer 43

GLMM allows for some coefficients to be modelled as ransom variables themselves. Predictor variables are split into random and fixed effects. All levels of categorical random effect variables are included as predictors (no base level) g(u) = b0 + b1x1 + ... + lambda1z1 + lambda2z2 + ...

Answer 44

Resulting estimate of each lambda will be closer to the grand mean prediction compared to full credibility estimates from a GLM. The less data, the closer to the grand mean.

Answer 45

1. betas and dispersion parameters are estimated, lambda distribution is assumed. 2. Lambda parameters are estimated.

Answer 46

Can also be used to account for distortion in GLM results from correlation among observations in data. Ex: if data had multiple renewals of same policy, we could add policy ID as random effect.

Answer 47

DGLMs allo for scale parameter phi to vary by observation with phi_i modelled as linear combination of predictors. This gives less weight to volatile data and more weight to stable data (ignore more noise and pick up more signal)

Answer 48

1. Some classes of business are more volatile than others 2. You care about having an accurate predicted variable of each observation, not just the mean 3. When using Tweedie distribution to model PP or Les, DGLMs provide flexibility for underlying frequency and severity to move in opposite direction

Answer 49

Can be estimated using statistical softwares. If y distribution is Tweedie, we can use regular GLM software through iterative process.

Answer 50

GAMs do not provide a simple coefficient for a predictor, its relationship to target variable must be examined graphically.

Answer 51

GAMs allow for non-linear effects in the model. g(u) = b0 + f1(xi1) + f2(xi2) + ... fp(xip) Shape of f( ) would be determined by modeling software. Be careful: allowing too much flexibility will risk overfitting.

Answer 52

MARS allow us to incorporate non-linearity by using piecewise linear (hinge) functions. The model will choose functions and cut points automatically. Be careful: allowing too much flexibility will risk overfitting.

Answer 53

1. Unlike GLM, MARS performs its own variable selection (will only keep those that are significant) 2. Can be used to find non-linear transformations or interactions

Answer 54

They are specified the same as regular GLMs, but when fitting the model, we add a penalty to deviance (which we try to reduce) based on size and magnitude of coefficients. Provide a powerful means against overfitting even in presence of many predictors. Deviance + lambda(a * sum of abs(b) + (1-a) * sum of b^2) b are coefficients excl. intercept b0 lambda is the tuning parameter to control severity of penalty. As lambda increases, coefficients tend to 0 and penalty reduce. a is the weight parameter. sum of abs(b) is penalty in lasso models sum of b^2 is penalty in ridge models

Answer 55

Advantages: 1. Equivalent to using credibility weighting for coefficients 2. Can perform automatic variable selection 3. Perform better than GLMs when predictors highly correlated Disadvantage: Computational complexity, especially with large datasets.