CH 18 (WM) Flashcards
List the assumptions of classical linear models. [1.75]
- error terms are independent and come from a normal distribution ✓✓
- the error terms have constant variance✓✓ (or homoscedasticity) ✓
- the mean is a linear combination of the explanatory variables ✓✓
What are the drawbacks for the normal model for multiple linear regression? [2]
- it assumes the response variable has a normal distribution ✓✓
- the normal distribution has a constant variance which may not be appropriate ✓✓
- it adds together the effects of different explanatory variables, but this is often not what is observed ✓✓
- it becomes long-winded with more than two explanatory variables ✓✓
Define the term “explanatory variables”. [1.5]
Explanatory variables are inputs into a model that are expected to influence the response variable.✓✓
In a pricing context, the explanatory variables would be rating factors.✓✓
It is important that explanatory variables make intuitive sense.✓✓
Define the term “response variables”. [1]
Response variables are outputs from the model that are likely to be affected by the explanatory variables.✓✓
In an overall pricing context, the response variable would be the price.✓✓
Define the terms “categorical and non-categorical variables”, together with examples of each. [3]
Categorical variables are explanatory variables that are used for modelling where the values of each level are distinct✓✓, and often cannot be given any natural ordering or score✓✓. An example of this would be gender, which can take the value of male or female✓✓.
By contrast, non-categorical variables can take numerical values, eg age.✓✓
Categorical variables are sometimes referred to as factors.✓✓ The majority of explanatory variables used in practice within GLMs for insurance are factors.✓✓
What are meant by the levels of a categorical variable? [1]
The levels of a categorical variable are simply the distinct values that the variable can take.✓✓
So, if gender is a variable in a GLM and it can only take the values “male” or “female”, then gender would be said to have two levels.✓✓
Explain how continuous numerical variables like age can be treated. [1.5]
Often, continuous numerical variables like age can be treated as categorical variables.✓✓
For example, if the “age of policyholder” variable was grouped into age bands (of 5 years for example), the new variable “age band” would be a categorical variable✓✓. This is because each such band is effectively a discrete category, ie a level of a categorical variable✓✓.
“Categorical variable” appears in every sentence
List the various techniques used to analyse the significance of the explanatory variables used in a model. [1]
- The chi-squared test
- The F statistic
- the Akaike Information Criteria (“AIC”)
- Other
Explain what is meant by a nested model. [2.25]
Two models are nested if one model contains explanatory variables that are a subset of the explanatory variables in the other model.✓✓
For example, if Model 1 has linear predictor: a + bx ✓✓
and Model 2 has linear predictor: a + bx + cx^2 ✓✓
then Model 1 is a subset of Model 2 ✓✓, ie Models 1 and 2 are nested.✓
Describe how you will apply the Chi-squared statistic to analyse the significance of the explanatory variables used in a model. [2]
If Models 1 and 2 are nested, then the change in scaled deviance follows a chi-squared distribution, ✓✓ ie:
Formula = { } ✓✓✓✓
This measures whether the inclusion of one or more additional explanatory variables in a model improves the model fit significantly.✓✓
Suppose that Model A and Model B are nested models with 6 and 10 parameters respectively.
The scaled deviance of Model A is 17.80 and for Model B is 11.08. Explain whether Model B is a significant improvement on Model A. [2]
(Question 18.10)
The difference in the scaled deviance is 6.72. ✓✓
The difference in the number of degrees of freedom is the same as the difference in the numbers of parameters in the models, ie 4. ✓✓
Since 6.72 < 9.488 , the upper 5% point of the chi-squared statistic ✓✓
there is insufficient evidence at the 5% significance level to reject Model A in favour of Model B. ✓✓
(page 168 of the Tables.)
Describe how you will apply “F statistics” to analyse the significance of the explanatory variables used in a model. [3.25]
In cases where the scale parameter for the model is unknown, for example when using the gamma distribution, it has to be estimated.✓✓
The estimate of the scale parameter is distributed as a chi-square distribution.✓✓
The ratio of the change in the deviance and the scale parameter estimate is distributed with an F distribution ✓✓, since the F distribution is the ratio of chi-square distributions ✓:
Formula = { } ✓✓✓✓
Note that the models need to be nested for this result to be valid.✓✓
Suppose Model C and Model D are nested models with 8 and 16 parameters respectively, and have been fitted to a set of 50 observations. The deviance for Model C is 40.89 and the deviance for Model D is 26.40. The scale parameter is unknown.
Explain whether Model D is a significant improvement on Model C. [3.25]
Question 18.11
The difference in the deviance is 14.49. ✓✓
The difference in the number of degrees of freedom is (50 – 8) – (50 – 16) = 8. ✓✓
The number of degrees of freedom in Model D is 34. ✓✓
So the value of the test statistic is: 2.33. ✓✓
From page 172 of the Tables, the upper 5% point of F(8,34) is 2.225. ✓✓
Since our test statistic exceeds this value✓✓, we reject Model C in favour of Model D. ✓
Describe how you will apply the Akaike Information Criteria (AIC) to analyse the significance of the explanatory variables used in a model. [3]
In cases where models are not nested, the AIC can be used to compare them.✓✓
The AIC for a model is calculated as: -2x log-likelihood + 2x number of parameters.✓✓
The AIC looks at the trade-off of the likelihood of a model against the number of parameters✓✓: the lower the AIC, the better the model.✓
For example, if two models fit the data equally well in terms of the log-likelihood ✓✓, then the model with fewer parameters is the more parsimonious✓✓, ie simpler, (and therefore “better”).✓
Define the term “generalised linear model (“GLM”). [2.75]
A generalised linear model (GLM) is a flexible generalisation of linear regression.✓✓
Generalised linear models are used to assess and quantify the relationship between a response variable and a set of possible explanatory variables.✓✓
For example, a GLM can be used to model the behaviour of a random variable✓ that is believed to depend on the values of several characteristics, eg age, gender and chronic condition✓✓.
These kinds of models can be used in a number of applications for private medical insurance✓ including risk modelling, pricing, financial projections and overall modelling of the business.✓✓✓✓