- outputs from the model that are likely to be affected by explanatory variables

- included where pattern of response variable is better modelled by including parameters for each combination of two or more factors

Chapter 18 - GLM Flashcards by Sian Bourdin

Explanatory variables

inputs into the model that are expected to influence the response variable
choice of explanatory variables depends on the purpose of the model

How well did you know this?

Not at all

Perfectly

Response variables

outputs from the model that are likely to be affected by explanatory variables

How well did you know this?

Not at all

Perfectly

Categorical variables

explanatory variables
aka factors
values of each level or distinct
often cannot be given natural ordering or score
continuous numerical variables (e.g. age) are often categorical

How well did you know this?

Not at all

Perfectly

Non-categorical variables

can take numerical values

How well did you know this?

Not at all

Perfectly

Interaction terms

included where pattern of response variable is better modelled by including parameters for each combination of two or more factors

How well did you know this?

Not at all

Perfectly

What does a GLM do?

A GLM unpicks relationships and produces estimates of the true values of the relativities. It does this by taking account of correlations and allowing for investigation of any interactions between variables in the model

How well did you know this?

Not at all

Perfectly

Assumptions of classic linear model

the error terms are independent and come from a normal distribution
the mean is a linear combination of the explanatory variables
the error terms have constant variance

How well did you know this?

Not at all

Perfectly

Can estimate the parameters B0, B1, B2 using method of maximum likelihood

pg.635

How well did you know this?

Not at all

Perfectly

Drawbacks of the normal model for multiple linear regression

assumes that the response variable has a normal distribution which may not be appropriate for the variable being modelled
the normal distribution has a constant variance which may not be appropriate for the variable being modelled
adds together the effects of different explanatory variables, but this is seldom what is observed in practice
with more than two explanatory variables, a manual solution becomes increasingly long-winded

How well did you know this?

Not at all

Perfectly

How do GLMs address these problems?

the response variable can take any distribution from the exponential family
a link function is introduced which acts to remove the assumption that the effects of different variables must simply be added together
allow an offset term to be included within the linear predictor

How well did you know this?

Not at all

Perfectly

GLM form

Pg. 639

How well did you know this?

Not at all

Perfectly

Properties of members of the exponential family

the distribution is completely specified in terms of its mean and variance
the variance of Yi is a function of its mean

How well did you know this?

Not at all

Perfectly

Requirements for link function

differentiable

- monotonic

How well did you know this?

Not at all

Perfectly

Obtaining the predicted values from a simple GLM

specify the design matrix X and the vector of parameters B
choose the distribution for the response variable and the link function
identify log-likelihood function
take the log to convert product into sum
maximise the log of the likelihood function by taking partial derivatives with respect to each parameter, setting them to zero and solving the result of the system of equations
compute the predicted values

How well did you know this?

Not at all

Perfectly

Degrees of freedom

number of observations less the number of parameters

How well did you know this?

Not at all

Perfectly

Deviance formula

Study These Flashcards

Compares observed value Y to fitted value u, with allowance for weights
pg.649

Nested models

Study These Flashcards

Two models are nested if one model contains explanatory variables that are a subset of the explanatory variables in the other model.

How to compare two nested models

Study These Flashcards

chi-square test for the change in scaled deviance
this measures whether the inclusion of one or more additional explanatory variables in a model improves the model fit significantly

F statistics

Study These Flashcards

in the case where the scale parameter for the model is unknown (gamma) it has to be estimated
the estimate of the scale parameter is chi-squared
the ratio of the change in the deviance and the scale parameter is distributed with F distribution

How to compare models that are not nested?

Study These Flashcards

AIC = -2log likelihood + 2number of parameters

looks at the tradeoff of the likelihood of a model against the number of parameters; the lower the AIC, the better the model. If two models fit the data equally well in terms of the log-likelihood, then the model with fewer parameters is better.

Use of CRLB

Study These Flashcards

can be used to measure the uncertainty in the parameter estimators used in a GLM. A poorly defined parameter will have a large standard error.
standard errors can be found from the Hessian matrix (matrix of second derivatives of the log-likelihood)
standard errors are the diagonal entries of -G^(-1)

Other ways to test significance

Study These Flashcards

consider spread of relativity values for each level, combined with the standard errors at each level
comparison over time - analysis of claims frequency by factor by year will indicate whether claims frequencies have been stable over time. Can fit a model that includes interaction of a single factor with measure of time.
consistency checks with other factors e.g. age and region

Hat matrix

Study These Flashcards

shifts the vector of observed values to the vector of fitted values

ith leverage

Study These Flashcards

ith diagonal element of the hat matrix (lies between 0 and 1)
measure of how much influence the ith observation has over its own fitted value

deviance residual

- measure of distance between actual observation and fitted value = pg.656

standardised pearson residual

- difference between observed response and predicted value, adjusted for the standard deviation of the predicted value and the leverage of the observed response - can compare standardised Pearson residuals - does not adjust for the shape of the distribution - pg.657

Testing appropriateness of models

- deviance residual - standardised Pearson residuals - residual plots - Cook's distance and leverage

Residual plots

If distribution chosen for response variable is appropriate, then the residual chard should produce residuals that: - are symmetrical about the x-axis - have an average residual of zero - fairly constant across the width of the fitted values

Cook's distance and leverage

- data points with large residuals and/or high leverage may distort the outcome/accuracy of a regression model - Cooke's distance is used to estimate the influence of a data point on the model results - more than 1 merits closer analyses

Model refinement

- interactions - aliasing - restrictions - smoothing

Interactions

- interactions relate to the effect that factors have upon the risk. An interaction would be necessary where the effect of two or more factors depend on each other - Complete interactions: one way of representing an interaction is to consider a single factor representing every combination of the two factors - Marginal interactions Consider the single-factor effects of 1 and 2 and additional effect of interaction term over and above single factor effects

How to decide which interaction terms to test for inclusion in a GLM

- analyse every possible combination of pairs and test each for statistical significance and reasonableness - look at structure of existing rating algorithm and see what interactions can be included without need for IT support - use experience of the product and market - speak to underwriters and other experts to see whether there are any parts of the account where your rates might be out of line with the market

Aliasing

Aliasing occurs when there is a linear dependency among observed covariates (one covariate may be identical to some linear combination of other covariates) Intrinsic aliasing - dependencies inherent in the definition of the covariates (most common when categorical variables are included) Extrinsic aliasing - dependency results from the nature of the data, rather than as a result of the inherent properties of the covariates e. g. pg.662 Near aliasing - two or more factors contain levels that are almost, but not quite, perfectly correlated - convergence problems can arise

How can factors be simplified (reduce granularity)?

1. grouping and summarising the data prior to loading - carried out in order to clean the data and prevent anomalies, rather than to smooth results - requires knowledge of expected pattern 2. grouping in the modelling package - simply assigns a single parameter to represent the relativity for multiple levels of a factor - a factor where two or more levels have been grouped together is called a customer factor. The GLM would only calculate parameter estimates for those levels. - not grouped: simple factor. Parameter estimates are calculated for each.

Restrictions

- when the use of certain factors is restricted, the model may be able to compensate by adjusting the fitted relativities for correlated factors. This is achieved using the offset term - the known, predetermined values of the parameters corresponding to this factor are added to the offset term.

Chapter 18 - GLM Flashcards

(35 cards)