A2. GLM Flashcards

1
Q

Advantage of using log link function for rate making

A
  1. Simple and practical to implement
  2. Guarantee positive premiums
  3. impact of risk characteristics is more intuitive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 uses of offset terms

A
  1. incorporate pre-determined values for certain variables
  2. when target variable varies directly on a particular measure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2 solutions to deal with correlation

A
  1. remove all except one (but could lose unique info)
  2. Dimensionality reduction technique like PCA/Factor analysis (but takes additional time)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Problem with correlations among variables

A

could produce unstable model with erratic coefficients that have high standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

2 uses of weight assigned to each observation

A
  1. when an observation can obtain grouped information
  2. when different observations represent different time periods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

define multicollinearity

A

nearly perfect linear dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how to detect multicollinearity

A

use VIF (variance inflation factor) to detect
VIF >=10 considers high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

define aliasing

A

perfect linear dependency
GLM will not converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

2 GLM limitations

A
  1. GLMs give full credibility, even to low volume of data or high volatility
  2. GLMs assume that randomness of outcomes are uncorrelated (renewal of the same policy, weather events)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

4 Advantages of modeling frequency/severity over pure premium

A
  1. gains more insights and intuition about impact of each predictor variable.
  2. each of frequency and severity separately is more stable
  3. pure premium modeling can lead to overfit if a predictor variable only impact frequency or severity but not both.
  4. Tweedie distribution for pure premium model assumes both frequency and severity move in the same direction (which may not be true)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

2 disadvantage of modeling frequency/severity over pure premium

A
  1. require more data
  2. take more time to build 2 models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4 ways to transform variables in GLM

A
  1. binning the variable (increase df, more things to estimate which may lead to overfit, may result in inconsistent or impractical pattern, variation within bins is ignored)
  2. Add polynomial terms (loss of interpretability without a graph, higher order polynomials can behave erratically at edges of the data)
  3. Add piecewise linear function (adding hinge function max (0, Xj-C) at each breaking point C, Breaking point C must be manually chosen)
  4. Natural Cubic Splines (combines piecewise function and polynomials, results in continuous curve but fits edges of the data better, but need graph to interpret model)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is model selection different from model refinement

A
  1. some model may be proprietary
  2. decision on final model may be a business decision not a technical one
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

3 methods to test model stability

A
  1. Cook’s distance for individual record (high cook’s distance should be given additional scrutiny to whether include or not)
  2. Cross-validation comparing in-sample parameter estimates
  3. Bootstrapping to compare mean and variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

4 Lift based measures

A
  1. simple quintile plot
  2. double lift chart
  3. loss ratio charts
  4. Gini index
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe double lift charts

A

Calculate sort ratio (sort ratio = model 1 predicted loss cost / model 2 predicted loss cost)
Sort and bucket
Calculate average predicted loss cost for each model in each quantile and average actual loss cost. divide each by the overall average loss cost and plot.
Winning model: one that best matched the actual in each quantile

15
Q

describe simple quintile plot

A

Sort data based on predicted loss costs
Then bucket into quantiles with equal exposures
Calculate average predicted loss cost & average actual loss cost for each bucket and graph
Winning model: Predictive accuracy, look at the difference between actual and predicted. Monotonicity, the actual pure premium should increase.
Vertical distance of actual loss cost between 1st and last quantile should be large

16
Q

Describe loss ratio charts

A

Sort based on predicted LRs
Bucket into quantiles with equal exposures
Calculate the actual LRs for each quantile & plot
The greater vertical distance between the lowest and highest LR, the greater the model at identifying further segmentation opportunities not present in the current rating plan.
LR is monotonically increasing for each quantile
This is the most easy one to understand

17
Q

Describe Gini index

A

ability to identify the best and worst risk
Sort holdout dataset based on the predicted loss cost.
Plot x as cumulative percent of exposures. y as the cumulative percent of actual loss
The curve formed is the Lorenz Curve.
Compare it with the line of equality
The area between the Lorenz Curve and Line of equality is called the Gini Index
The higher Gini Index, the better

18
Q

Sensitivity formula

A

true positive/total event occurrences

19
Q

Specificity formula

A

true negatives/total event non-occurrences

20
Q

Partial Residual formula

A

ri = (yi - mui)* g’(mui) + BetajXij

21
Q

Scaled Deviance formula

A

2*(loglikelihood from saturated model - Loglikelihood from the model)

22
Q

Unscaled Deviance formula

A

Dispersion parameter x scaled deviance

23
Q

F statistics

A

[Unscaled deviance(small) - Unscaled deviance(big)]/ [# of parameters added) x dispersion parameter,big]

24
Q

degree of freedom of F statistics

A

df of the numerator: # of parameters added
df of the denominator: # of observations - # of parameters in the big model

25
Q

working residual formula

A

wri = (yi-mui) * g’(mui)

26
Q

how does binning working residual work

A

Binning to reduce cluster in scatterplot
working weights = wwi = wi// ( V(mui) [g’(mui)^2]
Binning working residuals: weighted average of working residual based on working weights

27
Q

Model building process

A
  1. set goals & objectives (such as purpose, time frame, resources)
  2. communicate with key stakeholders to identify the requirements and concerns
  3. Collect and process the data
  4. Exploratory data analysis
  5. Specify the form of the model
  6. Evaluate the model output
  7. validate the model by testing it using a holdout dataset and pick optimal model
  8. translate the model results into a product
  9. maintain and rebuild
28
Q

Merging policy and claim data considerationa

A
  1. matching claims to specific policy
  2. check for timing difference, when each data is updated
  3. is there an unique key to merge
  4. level of aggregation, time dimension (PY vs. CY)
  5. fields not needed or missing
29
Q

things to check when cleaning the data

A
  1. remove duplicates
  2. check categorical fields against documentation
  3. Reasonability of numerical fields
  4. How to handle errors or missing variables
  5. convert continuous variable into categorical?