A2. GLM Flashcards
Advantage of using log link function for rate making
- Simple and practical to implement
- Guarantee positive premiums
- impact of risk characteristics is more intuitive
2 uses of offset terms
- incorporate pre-determined values for certain variables
- when target variable varies directly on a particular measure
2 solutions to deal with correlation
- remove all except one (but could lose unique info)
- Dimensionality reduction technique like PCA/Factor analysis (but takes additional time)
Problem with correlations among variables
could produce unstable model with erratic coefficients that have high standard error
2 uses of weight assigned to each observation
- when an observation can obtain grouped information
- when different observations represent different time periods
define multicollinearity
nearly perfect linear dependency
how to detect multicollinearity
use VIF (variance inflation factor) to detect
VIF >=10 considers high
define aliasing
perfect linear dependency
GLM will not converge
2 GLM limitations
- GLMs give full credibility, even to low volume of data or high volatility
- GLMs assume that randomness of outcomes are uncorrelated (renewal of the same policy, weather events)
4 Advantages of modeling frequency/severity over pure premium
- gains more insights and intuition about impact of each predictor variable.
- each of frequency and severity separately is more stable
- pure premium modeling can lead to overfit if a predictor variable only impact frequency or severity but not both.
- Tweedie distribution for pure premium model assumes both frequency and severity move in the same direction (which may not be true)
2 disadvantage of modeling frequency/severity over pure premium
- require more data
- take more time to build 2 models
4 ways to transform variables in GLM
- binning the variable (increase df, more things to estimate which may lead to overfit, may result in inconsistent or impractical pattern, variation within bins is ignored)
- Add polynomial terms (loss of interpretability without a graph, higher order polynomials can behave erratically at edges of the data)
- Add piecewise linear function (adding hinge function max (0, Xj-C) at each breaking point C, Breaking point C must be manually chosen)
- Natural Cubic Splines (combines piecewise function and polynomials, results in continuous curve but fits edges of the data better, but need graph to interpret model)
Why is model selection different from model refinement
- some model may be proprietary
- decision on final model may be a business decision not a technical one
3 methods to test model stability
- Cook’s distance for individual record (high cook’s distance should be given additional scrutiny to whether include or not)
- Cross-validation comparing in-sample parameter estimates
- Bootstrapping to compare mean and variance
4 Lift based measures
- simple quintile plot
- double lift chart
- loss ratio charts
- Gini index
Describe double lift charts
Calculate sort ratio (sort ratio = model 1 predicted loss cost / model 2 predicted loss cost)
Sort and bucket
Calculate average predicted loss cost for each model in each quantile and average actual loss cost. divide each by the overall average loss cost and plot.
Winning model: one that best matched the actual in each quantile
describe simple quintile plot
Sort data based on predicted loss costs
Then bucket into quantiles with equal exposures
Calculate average predicted loss cost & average actual loss cost for each bucket and graph
Winning model: Predictive accuracy, look at the difference between actual and predicted. Monotonicity, the actual pure premium should increase.
Vertical distance of actual loss cost between 1st and last quantile should be large
Describe loss ratio charts
Sort based on predicted LRs
Bucket into quantiles with equal exposures
Calculate the actual LRs for each quantile & plot
The greater vertical distance between the lowest and highest LR, the greater the model at identifying further segmentation opportunities not present in the current rating plan.
LR is monotonically increasing for each quantile
This is the most easy one to understand
Describe Gini index
ability to identify the best and worst risk
Sort holdout dataset based on the predicted loss cost.
Plot x as cumulative percent of exposures. y as the cumulative percent of actual loss
The curve formed is the Lorenz Curve.
Compare it with the line of equality
The area between the Lorenz Curve and Line of equality is called the Gini Index
The higher Gini Index, the better
Sensitivity formula
true positive/total event occurrences
Specificity formula
true negatives/total event non-occurrences
Partial Residual formula
ri = (yi - mui)* g’(mui) + BetajXij
Scaled Deviance formula
2*(loglikelihood from saturated model - Loglikelihood from the model)
Unscaled Deviance formula
Dispersion parameter x scaled deviance
F statistics
[Unscaled deviance(small) - Unscaled deviance(big)]/ [# of parameters added) x dispersion parameter,big]
degree of freedom of F statistics
df of the numerator: # of parameters added
df of the denominator: # of observations - # of parameters in the big model
working residual formula
wri = (yi-mui) * g’(mui)
how does binning working residual work
Binning to reduce cluster in scatterplot
working weights = wwi = wi// ( V(mui) [g’(mui)^2]
Binning working residuals: weighted average of working residual based on working weights
Model building process
- set goals & objectives (such as purpose, time frame, resources)
- communicate with key stakeholders to identify the requirements and concerns
- Collect and process the data
- Exploratory data analysis
- Specify the form of the model
- Evaluate the model output
- validate the model by testing it using a holdout dataset and pick optimal model
- translate the model results into a product
- maintain and rebuild
Merging policy and claim data considerationa
- matching claims to specific policy
- check for timing difference, when each data is updated
- is there an unique key to merge
- level of aggregation, time dimension (PY vs. CY)
- fields not needed or missing
things to check when cleaning the data
- remove duplicates
- check categorical fields against documentation
- Reasonability of numerical fields
- How to handle errors or missing variables
- convert continuous variable into categorical?