Chapter 18: GLM Flashcards
List assumptions of the classical linear model
response variable modelled as a linear combination of explanatory variables
error terms have normal distribution
error terms have constant variance
error terms are independent
Describe drawbacksof the classical linear model
model assumes a normal distribution which has constant variance, may not be appropriate
adds together the effects of different explanatory variables, but this is often not reality
may become long-winded with more than 2 explanatory variables
When doesintrinsic and extrinsic aliasing occur?
Intrinsic aliasing occurs:
because of dependencies inherent in the definition of the explanatory variables
this is dealt with by modelling software
Extrinsic aliasing occurs:
when two or more explanatory variables contain levels that are perfectly correlated
“Near aliasing” occurs when this correlation is almost, but not quite perfect
Why use a GLM over one-way analysis?
one-way analysis ignores correlation and interaction effects:
for example, the effect of smoker status on claim amount amount may be higher for males than females
for example, the effect of smoker status on claim amount amount may be higher for older ages compared to younger ages
as a result, the one-way analysis may underestimate the effect of smoker status on claim amount when considering older ages
glm appropriately accounts for correlations and interactions:
by simultaneously modelling the effects of explanatory variables on the response variable
Why use a GLM over classical linear model?
model not limited normal distribution:
can take on any distribution from the exponential family, for example poisson/ gamma
model not limited to the additive effects of explanatory variables:
can model the multiplicative effects of explanatory variables through use of a link function (transforming them to linearity)
variance of the response variable is a function of its mean and can often increase with the value of its mean:
for example poisson
Define thetotal devianceand thescaled deviance
total deviance
total deviance:
deviance is a measure of the distance bweteen the observed value (Y_i) to the fitted value (u_i)
with allowance for weights w_i - with higher importance assigned to errors where the variance should be small
the sum of each observation’s contribution to the deviance (d(Y_i,u_i)) is the total deviance for a model
D, total deviance = SUM (from i to n) of d(Y_i,u_i)
scaled deviance:
total deviance adjusted by the scale parameter phi
D*, scaled deviance = D/phi
thisstandardises the deviance so that it can be used when comparing different models
List 3 goodness of fit tests
chi-squared statistic:
used when comparing nested models and where the scale parameter is known
test statistic = (D_1)* - (D_2)*
which has thechi-squared distribution with degrees of freedom = df_1 - df_2
degrees of freedom is the number of observations less number of parameters
F-statistic:
used when comparing nested models and where the scale parameter is unknown
test statistic = [D_1 - D_2] / [(df_1 - df_2)*(D_2/df_2)]
which has theF-distribution with degrees of freedom = df_1 - df_2 ; df_2
AIC:
can be used when models are not necessarily nested
AIC = -2 * log-likelihood +2 * number of parameters
lower the AIC, the better the model
Explain 5 ways we can test for appropriateness of models
hat matrix:
H such that y_hat = H * y
The diagonal entries are called leverages that measure the influence the observed value has on their respective fitted value
deviance residuals:
which measures the distance between the observed and fitted values. Any large deviations may indicate that distributional assumptions are being violated
standardised pearson residuals:
which measures the distance between the observed value and fitted value, adjusted for the leverage from the observed value and variance of the fitted value
Cook’s distance:
alternative to the diagonal entries of the hat matrix where Cook’s distance > 1 may be cause for concern
residual plot:
where residuals are plotted against the fitted values. Residuals should be symmetrical about the x-axis and should have an average residual of zero
List examples of GLMs
gamma model:
may be a good model for claim amounts, log link
poisson model:
may be a good model for claim frequency, log link
logistic regression model:
may be a good model for binary outcome, logit link
consider odds ratios with p-value to assess significance
What can a GLM be used to model?
cost plpm - cost per life per month
List key items to mention when suggesting a model
specify explanatory variables, response variable
specify model, link function
consider interactions
consider the significance of coefficients (p-value, 95% ci)
Outline the advantages of the Tweedie distribution for modelling PMI claims`
The Tweedie distribution is a special member of the exponential family
that has a
point mass (large spike) at zero
and corresponds to the compound distribution of a Poisson claim number process and a gamma claim size distribution.
List two properties of the exponential family of distributions
Distribution completely specified in terms of its mean and variance
Variance of the response is a function of its mean