1. GLMs give full credibility, even to low volume of data or high volatility 2. GLMs assume that randomness of outcomes are uncorrelated (renewal of the same policy, weather events)

A2. GLM Flashcards by Liya Zhang

Advantage of using log link function for rate making

Simple and practical to implement
Guarantee positive premiums
impact of risk characteristics is more intuitive

How well did you know this?

Not at all

Perfectly

2 uses of offset terms

incorporate pre-determined values for certain variables
when target variable varies directly on a particular measure

How well did you know this?

Not at all

Perfectly

2 solutions to deal with correlation

remove all except one (but could lose unique info)
Dimensionality reduction technique like PCA/Factor analysis (but takes additional time)

How well did you know this?

Not at all

Perfectly

Problem with correlations among variables

could produce unstable model with erratic coefficients that have high standard error

How well did you know this?

Not at all

Perfectly

2 uses of weight assigned to each observation

when an observation can obtain grouped information
when different observations represent different time periods

How well did you know this?

Not at all

Perfectly

define multicollinearity

nearly perfect linear dependency

How well did you know this?

Not at all

Perfectly

how to detect multicollinearity

use VIF (variance inflation factor) to detect
VIF >=10 considers high

How well did you know this?

Not at all

Perfectly

define aliasing

perfect linear dependency
GLM will not converge

How well did you know this?

Not at all

Perfectly

2 GLM limitations

GLMs give full credibility, even to low volume of data or high volatility
GLMs assume that randomness of outcomes are uncorrelated (renewal of the same policy, weather events)

How well did you know this?

Not at all

Perfectly

4 Advantages of modeling frequency/severity over pure premium

gains more insights and intuition about impact of each predictor variable.
each of frequency and severity separately is more stable
pure premium modeling can lead to overfit if a predictor variable only impact frequency or severity but not both.
Tweedie distribution for pure premium model assumes both frequency and severity move in the same direction (which may not be true)

How well did you know this?

Not at all

Perfectly

2 disadvantage of modeling frequency/severity over pure premium

require more data
take more time to build 2 models

How well did you know this?

Not at all

Perfectly

4 ways to transform variables in GLM

binning the variable (increase df, more things to estimate which may lead to overfit, may result in inconsistent or impractical pattern, variation within bins is ignored)
Add polynomial terms (loss of interpretability without a graph, higher order polynomials can behave erratically at edges of the data)
Add piecewise linear function (adding hinge function max (0, Xj-C) at each breaking point C, Breaking point C must be manually chosen)
Natural Cubic Splines (combines piecewise function and polynomials, results in continuous curve but fits edges of the data better, but need graph to interpret model)

How well did you know this?

Not at all

Perfectly

Why is model selection different from model refinement

some model may be proprietary
decision on final model may be a business decision not a technical one

How well did you know this?

Not at all

Perfectly

3 methods to test model stability

Cook’s distance for individual record (high cook’s distance should be given additional scrutiny to whether include or not)
Cross-validation comparing in-sample parameter estimates
Bootstrapping to compare mean and variance

How well did you know this?

Not at all

Perfectly

4 Lift based measures

simple quintile plot
double lift chart
loss ratio charts
Gini index

How well did you know this?

Not at all

Perfectly

Describe double lift charts

Study These Flashcards

Calculate sort ratio (sort ratio = model 1 predicted loss cost / model 2 predicted loss cost)
Sort and bucket
Calculate average predicted loss cost for each model in each quantile and average actual loss cost. divide each by the overall average loss cost and plot.
Winning model: one that best matched the actual in each quantile

describe simple quintile plot

Study These Flashcards

Sort data based on predicted loss costs
Then bucket into quantiles with equal exposures
Calculate average predicted loss cost & average actual loss cost for each bucket and graph
Winning model: Predictive accuracy, look at the difference between actual and predicted. Monotonicity, the actual pure premium should increase.
Vertical distance of actual loss cost between 1st and last quantile should be large

Describe loss ratio charts

Study These Flashcards

Sort based on predicted LRs
Bucket into quantiles with equal exposures
Calculate the actual LRs for each quantile & plot
The greater vertical distance between the lowest and highest LR, the greater the model at identifying further segmentation opportunities not present in the current rating plan.
LR is monotonically increasing for each quantile
This is the most easy one to understand

Describe Gini index

Study These Flashcards

ability to identify the best and worst risk
Sort holdout dataset based on the predicted loss cost.
Plot x as cumulative percent of exposures. y as the cumulative percent of actual loss
The curve formed is the Lorenz Curve.
Compare it with the line of equality
The area between the Lorenz Curve and Line of equality is called the Gini Index
The higher Gini Index, the better

Sensitivity formula

Study These Flashcards

true positive/total event occurrences

Specificity formula

Study These Flashcards

true negatives/total event non-occurrences

Partial Residual formula

Study These Flashcards

ri = (yi - mui)* g’(mui) + BetajXij

Scaled Deviance formula

Study These Flashcards

2*(loglikelihood from saturated model - Loglikelihood from the model)

Unscaled Deviance formula

Study These Flashcards

Dispersion parameter x scaled deviance

F statistics

[Unscaled deviance(small) - Unscaled deviance(big)]/ [# of parameters added) x dispersion parameter,big]

degree of freedom of F statistics

df of the numerator: # of parameters added df of the denominator: # of observations - # of parameters in the big model

working residual formula

wri = (yi-mui) * g'(mui)

how does binning working residual work

Binning to reduce cluster in scatterplot working weights = wwi = wi// ( V(mui) [g'(mui)^2] Binning working residuals: weighted average of working residual based on working weights

Model building process

1. set goals & objectives (such as purpose, time frame, resources) 2. communicate with key stakeholders to identify the requirements and concerns 3. Collect and process the data 4. Exploratory data analysis 5. Specify the form of the model 6. Evaluate the model output 7. validate the model by testing it using a holdout dataset and pick optimal model 8. translate the model results into a product 9. maintain and rebuild

Merging policy and claim data considerationa

1. matching claims to specific policy 2. check for timing difference, when each data is updated 3. is there an unique key to merge 4. level of aggregation, time dimension (PY vs. CY) 5. fields not needed or missing

things to check when cleaning the data

1. remove duplicates 2. check categorical fields against documentation 3. Reasonability of numerical fields 4. How to handle errors or missing variables 5. convert continuous variable into categorical?

A2. GLM Flashcards

(31 cards)