Chapter 18 - GLM Flashcards
What are explanatory variables
They are inputs into a model that are expected to influence the response variable.
What are response variables
They are outputs from the model that are likely to be affected by the explanatory variables
What are categorical values and non-categorical values
They are explanatory variables that are used for modelling where the values of each level are distinct, and often cannot be given any natural ordering or score. Eg. gender which takes on the value male or female.
Non-categorical values can take numerical values, eg age
What are the drawbacks of using a normal model for linear regression
CLAN
- The normal distribution has a CONSTANT variance, which may not be appropriate for the variable being modelled
- More than 2 explanatory variables makes the time LONG to compute
- The normal model ADDS together the effects of the different explanatory variables, but this is seldom what is observed in practice
- It assumes that the response variable, Y, has a NORMAL distribution , which may not be appropriate for the variable being modelled
How does GLMs address the problems for the normal model for linear regression
- The response variable can take any distribution from the exponential family - it no longer has to take the normal distribution
- A link function is introduced - this acts to remove the assumption that the effects of different variables must simply be added together
What is the purpose of the link function
- The link function acts to remove the assumption that the effects of different variables must simply be added together
What does the deviance of a model compare
It compares the observed value Yi to the fitted value Ui
In essence, the deviance is a measure of how much the fitted values differ from the obervations
What do the chi-squared statistic measure
This measures whether the inclusion of one or more additional explanatory variables in a model improves the fit significantly
What can be used to measure the uncertainty in the parameter estimators used in GLM
The Cramer-rao lower bound
What are deviance residuals
Is a measure of the distance between the actual observation and the fitted values
What are standardised pearson residuals
It is the difference between the observed response and the predicted value,
adjusted for the standard deviation of the predicted value
and the leverage of the observed response
What is Cook’s distance used for
It is used to estimate the influence of a data point on the model results
Cook’s distance of 1 or more is considered to merit closer examination in the analysis
What is Aliasing
Aliasing occurs when there is dependency among the observed covariates. i.e one covariate may be identical to some linear combination of other covariates
What is intrinsic aliasing
This occurs because of dependencies inherent in the definition of the covariates.
These intrinsic dependencies arise most commonly whenever categorical variables are included in the model.
What is Extrinsic aliasing
It arises from a dependency among the covariates. It arises when the dependency results from the nature of the data itself, rather than as a result of the inherent properties of the covariates.
What is an interaction term
It is used where the pattern in the response variable is better modelled by including extra parameters for each combination of two or more factors
An interaction exists when the effect of one factor varies depending on the value of another factor
What are the assumptions of the classical normal model for linear regression
- The mean is a linear combination of the explanatory variables
- The error terms are independent and come from a normal distribution
- The error terms have constant variance
What 2 properties do the members of the exponential family have
- The distribution is completely specified in terms of its mean and variance
- The variance of Yi is a function of its mean
What is special about the Tweedie distribution
- The Tweedie distribution is a special member of the exponential family that has a variance function proportional to \mu^{p}, with p being an additional parameter
- In the case of 1<p<2, the Tweedie distribution has a point mass at zero and corresponds to the compound distribution of a Poisson claim number process and a gamma claim size distribution
- The distribution can be Poisson-like (as p tends to 1) or gamma-like (as p tends to 2)
How is the number of degrees of freedom calculated
It is the number of observations less the number of parameters
What are nested models
Two models are nested if one model contains explanatory variables that are a subset of the explanatory variables in the other model
What is the equation of the Akaike Information Criteria and what does it measure
AIC = -2xlog likelihood + 2x number of parameters
It looks at the trade-off of the likelihood of a model against the number of parameters: the lower the AIC, the better the model
What does the deviance residual measure
It measures the distance between the actual observation and fitted value
What is the difference between correlations and interactions
- Correlations occur when there is a relationship between the distribution of exposure between levels of two or more factors
- GLMs automatically take account of correlations (unlike one-way tables)
- Interactions relate to the effect that factors have upon the risk
- To define the risk accurately, an interaction would be necessary where the effect of two (or more) factors depend on each other. GLMs can be specified to include interactions.
What are GLMs used for
- They are used to assess and quantify the relationship between a response variable and a set of possible explanatory variables
What does the leverage measure
It measures how much influence each observed value has on the fitted value for that observation