GLM1 Flashcards
2 main components of a Generalized Linear Model.
random component
systematic component
random component
each yi is assumed to be independent and to come from the exponential family of distributions with mean µi and variance Var(yi) = φV(µi)/ωi .
φ is called the dispersion parameter and is a constant used to scale the variance.
V(µ) is called the variance function and it describes the relationship between the variance and mean for a selected distribution type.
ωi are known as weights and assign a weight to each observation i.
systematic component
which is of the form g(µi) = β0 + β1xi1 + β2xi2 +···+ βpxip +offset
right hand side is known as the linear predictor
offset term is optional and allows you to manually specify the estimates for certain variables
x’s are the predictor variables
g(µ) is called the link function
β0 is called the intercept term, and the other β’s are called the coefficients of the model. These are what we want to estimate. Once we know the β’s, we can plug in known values for the xi variables and calculate the predicted values of the yi variables (i.e., µi).
link function
allows for transformations of the linear predictor
For rating plans, the log link function g(µ) = ln(µ) is typically used since it transforms the linear predictor into a multiplicative structure.
3 advantages of using a log link function when building a pure premium model for use in creating a rating plan
A log link allows for a multiplicative rating plan, which has the following advantages:
- Simple and practical to implement.
- It guarantees positive premiums.
- Impact of risk characteristics is more intuitive.
choice of the base level for the Gender variable
Since Female has more observations than Male, Female should be the base level for Gender.
Choosing a level with fewer observations as the base level will still result in the same predicted relativities for that variable (re-based to the chosen base level), but there will be wider confidence intervals around the estimated coefficients
Both the standard error and p-value will increase since a level with fewer observations is being made the new base level, and the model will have less confidence in the base level estimate (in addition to the confidence in the difference between the base level and Territory C).
g(µ) = β0 +β1ln(InsuredAge)+β2Male+β3TerrA+β4TerrC+β5Male∗TerrA+β6Male∗TerrC
explain the meaning of each of the β parameters in the model.
β0 is the intercept (base level - female in territory B at age 1)
β1 is the change in g(µ) for a unit change in the natural log of insured age
β2 is the change in g(µ) for being male instead of female
β3 is the change in g(µ) for being in territory A instead of B
β4 is the change in g(µ) for being in territory C instead of B
β5 is the additional interaction effect on g(µ) for being male and in territory A
β6 is the additional interaction effect on g(µ) for being male and in territory C
2 common choices for modeling claim severity
and why
Gamma and Inverse Gaussian distributions
Claim severity distributions tend to be right-skewed and have a lower bound at 0. Both Gamma and Inverse Gaussian distributions exhibit these properties.
Inverse Gaussian has a sharper peak and wider tail than Gamma (for the same mean and variance), so the Inverse Gaussian is more appropriate for severity distributions that are more skewed.
When creating a GLM with a log link function, it is generally recommended that continuous predictor variables
& 2 exceptions
be logged
Logging continuous variables allows for more flexibility in fitting different curve shapes to the data, since an unlogged variable will imply that the only relationship is exponential growth
2 exceptions: Using a time variable(e.g. year) to pick up time effects & when the variables contain values of 0 since ln(0) is undefined
main benefit of GLMs over univariate analysis
able to handle exposure correlation
GLMs also run into problems when predictor variables are very highly correlated.
what can happen in GLMs with highly correlated predictor variables
This can result in an unstable model with erratic coefficients that have high standard errors.
two options for dealing with highly correlated predictor variables
i. Removing all highly correlated variables except one. This eliminates the high correlation in the model, but it also potentially loses some unique information contained in the eliminated variables.
ii. Use dimensionality-reduction techniques such as principal components analysis or factor analysis to create a new subset of variables from the correlated variables, and use this subset of variables in the GLM. The downside is the additional time required to do this extra analysis.
multicollinearity
Multicollinearity occurs when there is a near-perfect linear dependency among 3 or more predictor variables.
For example, suppose x1 + x2 ≈ x3.
When multicollinearity is present in a model, the model may become unstable with erratic coefficients, and it may not converge to a solution
one way to detect multicollinearity in a model.
Use the variance inflation factor (VIF) statistic, which is given for each predictor variable
measures the impact on the squared standard error for that variable due to collinearity with other predictor variables by seeing how well other predictor variables can predict the variable in question
aliasing
When there is a perfect linear dependency among predictor variables, those variables are aliased