Chapter 18 - GLM Flashcards

1
Q

Explanatory variables

A
  • inputs into the model that are expected to influence the response variable
  • choice of explanatory variables depends on the purpose of the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Response variables

A
  • outputs from the model that are likely to be affected by explanatory variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Categorical variables

A
  • explanatory variables
  • aka factors
  • values of each level or distinct
  • often cannot be given natural ordering or score
  • continuous numerical variables (e.g. age) are often categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Non-categorical variables

A
  • can take numerical values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interaction terms

A
  • included where pattern of response variable is better modelled by including parameters for each combination of two or more factors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a GLM do?

A

A GLM unpicks relationships and produces estimates of the true values of the relativities. It does this by taking account of correlations and allowing for investigation of any interactions between variables in the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Assumptions of classic linear model

A
  • the error terms are independent and come from a normal distribution
  • the mean is a linear combination of the explanatory variables
  • the error terms have constant variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can estimate the parameters B0, B1, B2 using method of maximum likelihood

A

pg.635

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Drawbacks of the normal model for multiple linear regression

A
  • assumes that the response variable has a normal distribution which may not be appropriate for the variable being modelled
  • the normal distribution has a constant variance which may not be appropriate for the variable being modelled
  • adds together the effects of different explanatory variables, but this is seldom what is observed in practice
  • with more than two explanatory variables, a manual solution becomes increasingly long-winded
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do GLMs address these problems?

A
  • the response variable can take any distribution from the exponential family
  • a link function is introduced which acts to remove the assumption that the effects of different variables must simply be added together
  • allow an offset term to be included within the linear predictor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

GLM form

A

Pg. 639

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Properties of members of the exponential family

A
  1. the distribution is completely specified in terms of its mean and variance
  2. the variance of Yi is a function of its mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Requirements for link function

A
  • differentiable

- monotonic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Obtaining the predicted values from a simple GLM

A
  1. specify the design matrix X and the vector of parameters B
  2. choose the distribution for the response variable and the link function
  3. identify log-likelihood function
  4. take the log to convert product into sum
  5. maximise the log of the likelihood function by taking partial derivatives with respect to each parameter, setting them to zero and solving the result of the system of equations
  6. compute the predicted values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Degrees of freedom

A

number of observations less the number of parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Deviance formula

A

Compares observed value Y to fitted value u, with allowance for weights
pg.649

17
Q

Nested models

A

Two models are nested if one model contains explanatory variables that are a subset of the explanatory variables in the other model.

18
Q

How to compare two nested models

A
  • chi-square test for the change in scaled deviance
  • this measures whether the inclusion of one or more additional explanatory variables in a model improves the model fit significantly
19
Q

F statistics

A
  • in the case where the scale parameter for the model is unknown (gamma) it has to be estimated
  • the estimate of the scale parameter is chi-squared
  • the ratio of the change in the deviance and the scale parameter is distributed with F distribution
20
Q

How to compare models that are not nested?

A

AIC = -2log likelihood + 2number of parameters

looks at the tradeoff of the likelihood of a model against the number of parameters; the lower the AIC, the better the model. If two models fit the data equally well in terms of the log-likelihood, then the model with fewer parameters is better.

21
Q

Use of CRLB

A
  • can be used to measure the uncertainty in the parameter estimators used in a GLM. A poorly defined parameter will have a large standard error.
  • standard errors can be found from the Hessian matrix (matrix of second derivatives of the log-likelihood)
  • standard errors are the diagonal entries of -G^(-1)
22
Q

Other ways to test significance

A
  • consider spread of relativity values for each level, combined with the standard errors at each level
  • comparison over time - analysis of claims frequency by factor by year will indicate whether claims frequencies have been stable over time. Can fit a model that includes interaction of a single factor with measure of time.
  • consistency checks with other factors e.g. age and region
23
Q

Hat matrix

A

shifts the vector of observed values to the vector of fitted values

24
Q

ith leverage

A
  • ith diagonal element of the hat matrix (lies between 0 and 1)
  • measure of how much influence the ith observation has over its own fitted value
25
Q

deviance residual

A
  • measure of distance between actual observation and fitted value
    = pg.656
26
Q

standardised pearson residual

A
  • difference between observed response and predicted value, adjusted for the standard deviation of the predicted value and the leverage of the observed response
  • can compare standardised Pearson residuals
  • does not adjust for the shape of the distribution
  • pg.657
27
Q

Testing appropriateness of models

A
  • deviance residual
  • standardised Pearson residuals
  • residual plots
  • Cook’s distance and leverage
28
Q

Residual plots

A

If distribution chosen for response variable is appropriate, then the residual chard should produce residuals that:

  • are symmetrical about the x-axis
  • have an average residual of zero
  • fairly constant across the width of the fitted values
29
Q

Cook’s distance and leverage

A
  • data points with large residuals and/or high leverage may distort the outcome/accuracy of a regression model
  • Cooke’s distance is used to estimate the influence of a data point on the model results
  • more than 1 merits closer analyses
30
Q

Model refinement

A
  • interactions
  • aliasing
  • restrictions
  • smoothing
31
Q

Interactions

A
  • interactions relate to the effect that factors have upon the risk. An interaction would be necessary where the effect of two or more factors depend on each other
  • Complete interactions:
    one way of representing an interaction is to consider a single factor representing every combination of the two factors
  • Marginal interactions
    Consider the single-factor effects of 1 and 2 and additional effect of interaction term over and above single factor effects
32
Q

How to decide which interaction terms to test for inclusion in a GLM

A
  • analyse every possible combination of pairs and test each for statistical significance and reasonableness
  • look at structure of existing rating algorithm and see what interactions can be included without need for IT support
  • use experience of the product and market
  • speak to underwriters and other experts to see whether there are any parts of the account where your rates might be out of line with the market
33
Q

Aliasing

A

Aliasing occurs when there is a linear dependency among observed covariates (one covariate may be identical to some linear combination of other covariates)

Intrinsic aliasing
- dependencies inherent in the definition of the covariates (most common when categorical variables are included)

Extrinsic aliasing

  • dependency results from the nature of the data, rather than as a result of the inherent properties of the covariates
    e. g. pg.662

Near aliasing

  • two or more factors contain levels that are almost, but not quite, perfectly correlated
  • convergence problems can arise
34
Q

How can factors be simplified (reduce granularity)?

A
  1. grouping and summarising the data prior to loading
    - carried out in order to clean the data and prevent anomalies, rather than to smooth results
    - requires knowledge of expected pattern
  2. grouping in the modelling package
    - simply assigns a single parameter to represent the relativity for multiple levels of a factor
    - a factor where two or more levels have been grouped together is called a customer factor. The GLM would only calculate parameter estimates for those levels.
    - not grouped: simple factor. Parameter estimates are calculated for each.
35
Q

Restrictions

A
  • when the use of certain factors is restricted, the model may be able to compensate by adjusting the fitted relativities for correlated factors. This is achieved using the offset term
  • the known, predetermined values of the parameters corresponding to this factor are added to the offset term.