Chapter 14 (GLM) Flashcards
Explanatory variables.
These are inputs to the model affect the response variables.
In a pricing context, the explanatory variables would typically be rating factors.
Response variables.
These are outputs from the model that are likely to be affected by the explanatory variables.
In an overall pricing context, the response variable would be the price.
Categorical variables
These are explanatory variables that are used for modelling where the values of each level are distinct, and often cannot be given any natural ordering or score. An example of this would be chronic status, which can take the value of yes or no.
Non-categorical variables
These are explanatory variables that can take numerical values, for example, age.
Interaction term
This is used where the pattern in the response variable is better modelled by including extra parameters for each combination of two of more factors. An interaction exists when the effect of one factor varies depending on the value of another factor.
Describe a one-way analysis and state its shortcomings. []
This is where GLMs are used to look at the effect on frequency and severity of each rating factor separately. Note that a one-way analysis ignores correlations and interaction effects between variables, for example, age and disease, age and family size, or maternity and gender. As a result, the model may underestimate or double count the effects of variables.
Link Function
Link function. This acts to remove the assumption that the effects of different variables must simply be added together. Instead, it defines a more complex yet appropriate relationship between the explanatory and response variables. It must be both differentiable and monotonic, either strictly increasing or strictly decreasing. Typical link functions include the log, logit, and identity functions. The log link function is of particular interest in pricing because its use results in a model where the effects of different rating factors are multiplied together.
Link Function
Link function. This acts to remove the assumption that the effects of different variables must simply be added together. Instead, it defines a more complex yet appropriate relationship between the explanatory and response variables. It must be both differentiable and monotonic, either strictly increasing or strictly decreasing. Typical link functions include the log, logit, and identity functions. The log link function is of particular interest in pricing because its use results in a model where the effects of different rating factors are multiplied together.
MLE
This refers to the statistical approach of estimating population parameters such as the mean and variance. The approach uses sample data together and estimates the parameter values that maximise the probability of obtaining the observed data.
GLM vs linear regression
GLM is a generalised form of linear regression models. It is more flexible than linear regression because a GLM can still work even when the response variables are not continuous or unbounded. Furthermore, a GLM allows unconstrained inputs—inputs can take any value—to affect the response variable.
GLM vs linear regression
GLM is a generalised form of linear regression models. It is more flexible than linear regression because a GLM can still work even when the response variables are not continuous or unbounded. Furthermore, a GLM allows unconstrained inputs—inputs can take any value—to affect the response variable.
Assumptions of classical linear models
Assumptions of classical linear models include:
l The error terms are independent and come from a normal distribution, and error terms are described below.
l The mean is a linear combination of the explanatory variables. l The error terms have constant variance, or homoscedasticity.
The normal model for multiple linear regression has a number of drawbacks. [ ]
l It assumes that the response variable, Y, has a normal distribution, which may not be appropriate for the variable being modelled.
l The normal distribution has a constant variance, which may not be appropriate for the variable being modelled. For example, the variance of claim numbers tends to increase as the expected value increases. The Poisson distribution has this property, so would be a more sensible choice.
l The normal model for multiple linear regression adds together the effects of the different explanatory variables, but this is seldom what is observed in practice. For example, the effects of ‘age’ and ‘family size’ might be multiplicative, rather than additive: if large families have three times as many claims as small ones, and old people have four times as many claims as young people, it might be expected that the combination of ‘old’ and ‘large family size’ results in 12 times as many claims. Indeed, this is often what is observed in practice.
l With more than two explanatory variables, a manual solution becomes increasingly long-winded.
The normal model for multiple linear regression has a number of drawbacks. [ ]
l It assumes that the response variable, Y, has a normal distribution, which may not be appropriate for the variable being modelled.
l The normal distribution has a constant variance, which may not be appropriate for the variable being modelled. For example, the variance of claim numbers tends to increase as the expected value increases. The Poisson distribution has this property, so would be a more sensible choice.
l The normal model for multiple linear regression adds together the effects of the different explanatory variables, but this is seldom what is observed in practice. For example, the effects of ‘age’ and ‘family size’ might be multiplicative, rather than additive: if large families have three times as many claims as small ones, and old people have four times as many claims as young people, it might be expected that the combination of ‘old’ and ‘large family size’ results in 12 times as many claims. Indeed, this is often what is observed in practice.
l With more than two explanatory variables, a manual solution becomes increasingly long-winded.
Chapter question 6
Explain whether a normal distribution would be appropriate for modelling claims costs per policyholder per month for a PMI contract.
Chapter solution 6 There is likely to be a large number of policyholders with zero or very small claims and a small number of people with very large claims, i.e. the true distribution will be positively skewed. The normal distribution does not have this property. A normal distribution can also take negative values, which would be inappropriate.