Chapter 4- Generalized Linear Models Flashcards
Ways in which a GLM generalizes a linear model (2; 4.1)
1) Distribution of the target variable - the target variable of a GLM is no longer confined to the class of normal random variables; it need only be a member of the exponential family of distributions. GLMs therefore provide a unifying approach to modeling binary, discrete, and continuous target variables with different mean-variance relationships
2) Relationship between the target mean and linear predictor - instead of equating the mean of the target variable directly with the linear combination of predictors, a GLM sets a function of the TARGET MEAN (important) to be linearly related to the predictors (target variable is left untransformed)
Equation of a GLM (4.1)
Relationship between linear model and GLM (4.1)
A linear model is a special case of a GLM where the target variable is normally distributed and the link function is the identity function g(u) = u
Target distributions for variables that can be analyzed in the GLM framework (4; 4.1.1)
1) Positive, continuous, and right skewed data - gamma and inverse Gaussian distributions capture the skewness of these variables without the use of transformations (values cannot be zero)
2) Binary data - binomial (or Bernoulli) distribution is the most reasonable choice for the target distribution
3) Count data (non-negative whole number of occurrences, possibly with a right skew) - Poisson distribution is the most natural for the target distribution. Poisson requires target mean and variance to be equal; if that’s not the case, use the quasi-Poisson distribution, which accounts for overdispersion (variance > mean) or under-dispersion (mean > variance)
4) Aggregate data - see next card on Tweedie distribution
Tweedie distribution (4.1.1)
The Tweedie distribution has a mixture of discrete and continuous components. A random variable S follows this distribution if:
S = SUM (i = 1 to N) [ Xi ]
N is a Poisson random variable and the Xi’s are gamma random variables independent of N
1) As a Poisson sum of gamma random variables, Tweedie is an “in-between” distribution of Poisson and gamma
2) Has discrete probability mass at zero and a probability density function on the positive real line, making Tweedie particularly suitable for modeling aggregate claim losses
Considerations for choosing the link function for a GLM (3; 4.1.1)
1) Appropriateness of predictions - The range of values of the target mean implied by the GLM should be consistent with the range of values of the target mean in that given situation (examples on next card)
2) Interpretability - a GLM is easy to interpret if it is easy to form statements that describe the association between the predictors of the model and the target mean in terms of model coefficients
3) Canonical link associated with each target distribution - advantage of simplifying the mathematics of the estimation procedure and making it more likely to converge
Canonical and common link functions for common GLM distributions (4.1.1)
Log and logit link functions (4.1.1)
1) Log link - g(u) = ln(u)
a) Ensures the mean is always positive and unbounded; generates multiplicative model
b) Target observations can be zero, just not the target mean itself
1.1) Interpretation - target mean u = e^B0 * e^B1X1 * … * e^BpXp
a) When all other variables are held fixed, a unit increase in Xj is associated with a multiplicative increase in the target mean by a factor of e^Bj, or a percentage change in target mean of 100*(e^Bj - 1)%
2) Logit link - g(u) = ln (u/(1-u)) = ln(odds)
a) Used for binary target variables where u is the mean, ensures target mean always between 0 and 1
2.1) Interpretation - a unit change in Xj leads to a percentage change IN ODDS of 100*(e^Bj - 1)%
Weights and offsets in GLMs (3; 4.1.2)
Ways of taking exposure into account to improve the fit of a GLM (when grouped data is provided)
1) Weights - higher weights will take advantage of grouped observations with a larger exposure, which have smaller variance and higher degrees of precision
a) Use when variance of each observation is inversely related to the size of exposure
b) Observations of the target variable should be averaged by exposure
c) Weights do not affect the mean of the target variable
2) Offsets - When the target variable is directly proportional to a predictor in count data, we can add an offset ln(Ei), which is the log of the exposure of the i’th observation
a) Observations are values aggregated over the exposure units
b) Exposure is in direct proportion to the mean of the target, but otherwise leaves its variance unaffected
c) Use of an exposure variable (E) as an offset - assumes that target mean varies in direct proportion to E
d) Use of E as an ordinary predictor - assumes that target mean varies in relation to E as defined by the link function. Coefficient of E is an additional parameter that needs estimated.
3) Weights = offsets in the Poisson GLM with the log link
Fitting a GLM (4.1.3)
1) Commonly use the maximum likelihood estimation (MLE) method - choosing the parameter estimates in such a way to maximize the likelihood of observing the given data, typically by running an optimization algorithm.
2) MLE produces estimates with desirable statistical properties, such as asymptotic unbiasedness, efficiency, and normality
GLM goodness of fit measures (2; 4.1.3)
1) Deviance (global) - measures the extent to which the GLM departs from the most elaborate GLM, the saturated model
a) Saturated model has the same target distribution and link function as the fitted GLM, but with as many model parameters as the size of the training set. Perfectly fits each training observation and is very (perhaps too!) flexible
b) Deviance formula: D = 2*(l.SAT - l), where l denotes the maximized loglikelihood in the saturated and fitted models, respectively
c) Lower the deviance, the better the fit
d) Deviance reduces to RSS for linear models
e) Can only be used to compare GLMs having the same target distribution
2) Deviance residuals (local) - defined as the SIGNED (positive if observed value > target mean, negative otherwise) square root of the contribution of the i’th observation to D.
a) Approximately normally distributed for most target distributions (binomial is a notable exception)
b) Have no systematic patterns when considered on their own and with respect to the predictors
c) Have approximately constant variance upon standardization (using the standard error implied by the GLM)
Note: Deviance parallels RSS in a linear model - it will always lead to the most elaborate GLM, which can lead to overfitting
GLM Likelihood Ratio Test (4.1.3)
1) Classical way to compare and select among different GLMs; generalization of the t-test and F-test for linear models
2) Consider two GLMs sharing the same target distribution and link function, where GLM.1 = GLM.0 + additional BjXj’s
Null Hypothesis: all additional Bj’s are zero (i.e. GLM.0 = GLM.1)
Test Statistic: LRT = 2(l.1 - l.0) = D.0 - D.1; where
l.0 and D.0 are the maximized loglikelihood and deviance for GLM.0 (and likewise l.1 and D.1 for GLM.1)
3) Because GLM.1 is a more flexible model, l.1 >= l.0 and D.0 >= D.1, so the LRT must be non-negative
4) If LRT is sufficiently large, then GLM.1 has a significantly better fit to the training data than GLM.0, so we reject the null hypothesis and select GLM.1
5) Drawbacks: can only be used to compare one pair of GLMs at a time; only applicable to cases where the simpler GLM is nested in the more elaborate GLM
Regularization for GLMs (4.1.3)
Works for GLMs in the same way as for linear models, with the goodness of fit metric set to deviance (instead of RSS). All other fundamental properties of regularization apply.
Confusion Matrices (4.1.4)
1) A binary classifier produces a prediction of the probability that the event of interest occurs. To translate the predicted probabilities to the predicted classes, we need a pre-specified cutoff
a) If predicted probability > cutoff, event is predicted to occur
b) If predicted probability < cutoff, event is predicted not to occur
2) Confusion matrix - 2x2 matrix of possible outcomes:
a) True Positive (TP): predicted true, observed true
b) True Negative (TN): predicted false, observed false
c) False Positive (FP): predicted true, observed false
d) False Negative (FN): predicted false, observed true
3) Metrics
a) Classification error rate = (FN + FP)/n
b) Accuracy = (TP + TN)/n
c) Sensitivity (true positive rate) = TP/(TP + FN)
d) Specificity (true negative rate) = TN/(TN + FP)
e) Precision (proportion of positive predictions that truly belong to the positive class) = TP/(FP + TP)
4) As cutoff increases, specificity increases while sensitivity decreases
Unbalanced Data (3; 4.1.4)
1) Definition - unbalanced data occurs when one class of a binary target variable is much more dominant than the other class in terms of proportion of observations (ex. policyholders without a claim much more numerous than policyholders with a claim)
2) Problem - a classifier implicitly places more weight on the majority class without paying enough attention to the minority class (especially problematic when the minority class is the positive class)
3) Solution - Undersampling and oversampling are methods to balance the training data (see next card)