Chapter 4: GLMs Flashcards
what is a GLM?
Comparatively, a GLM is more flexible than a linear model.
GLMs provide flexibility in two aspects:
- Distribution of the target variable:
- target variable is not confined to normal distribution. it only needs to be a part of the linear exponential family (contains both continuous and discrete)
- GLMs provide a unifying approach to modelling binary, discrete and continuous target variables with different mean-variance relationships - Relationship between target mean and the linear predictors
- instead of equating the target mean of the target variable directly with the linear combination, a GLM set a function of the target mean to be linearly related to the predictors.
- the link function can be any monotonic function (monotonic bc needs to be invertible)
can we use all feature generation (binarization, polynomial terms, interaction terms) and all feature selection techniques for GLM models?
yes
what does it mean to say “transformations are applied internally vs. externally)
GLMs: internally transforming the data
- the target variable is not transformed and the transformation plays its role only within the GLM itself
Linear models: externally transforming the target variable
what target distribution would we choose for a positive, continuous, right- skewed target variable?
gamma and inverse gaussian capture the skewness of the target variable directly without the use of transformations
inv. gaussian is more highly skewed than gamma
gamma is #1 choice here
what target distribution would we choose for a binary target variable?
binomial
the mean of the target variable is the probability that the event of interest occurs
what target distribution would we choose for a count variable?
count variable = represents the number of times a cretain event of interest happens over a reference time period
these variables only have non-negative integer values.
poission!
what target distribution would we choose for aggregate data?
tweedie. it is a poisson-gamma mixture.
discrete probability mass at zero and pdf on the positive real line
when is it good to use a log link?
for poisson, gamma and inverse gaussian
the target mean is positive and unbounded from above
g(mu) = ln(mu)
so then the inverse is exp(predictors) - which results in a positive value for target mean unbounded from above
when is it a good idea to use a logit link?
binary variables
logit link = ln(odds)
the logit link ensures that the target mean is between 0 and 1 (needs to be for a binary variable)
but the predictors can have an value from 0 to +inf
what is a logisitic regression model?
a GLM with a binary target variable and a logit link function
two factors need to be considered when choosing a link function:
- whether the predictions provided by the link align with the characteristics of the target variable
- whether the resulting GLM is easy to interpret (ex. logit is easier to interpret than probit)
T/F: the link function is to transform the target variable of a GLM so that the resulting distribution more closely resembles a normal distribution
false.
the link function is applied to the mean of the target variable; the target variable itself is left untransformed.
T/F: the main reason for using the log link in a GLM is to reduce the skewness of a non-negative, right skewed target variable.
false. The log link is chosen because it ensures appropriate predictions and eases model interpretation. The skewness can be accommodated by an appropriate target distribution
T/F: if some of the observations of the target variable are 0, then the log link cannot be used because ln(0) is not defined.
False.
the log link is not applied directly to the target variable itself
what two link functions are easiest to interpret ?
logit and log link
how to interpret GLM coefficients with log link for numeric predictors?
Multiplicative changes:
when all other variables are held fixed, a unit increase in X is associated with a multiplicative increase in the target mean by a factor of exp(beta)
how to interpret GLM coefficients with log link for categorical predictors?
if X is a dummy variable, then:
- at the baseline level, X = 0, and at the non-baseline level, X = 1.
SO
comparing the means, we see that the target mean when the categorical predictor lies in the non-baseline level is exp(beta) times of that when the categorical predictor is in the baseline level, holding all other predictors fixed.
how to interpret GLM coefficients with logit link?
the logit link is almost always used with binary data.
ln(odds) = f(x) or odds = exp( f(x) )
the interpretations are just phrased in terms of multiplicative changes in the odds of the event of interest.
a unit increase in a numeric predictor with coefficient beta is associated with a multiplicative change of exp(beta) in the odds.
what are weights and offsets?
modeling tools that are commonly used with GLMs.
they are designed to incorporate a measure of exposure into a GLM to improve the fitting
what is the idea behind using weights in a GLM?
to take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision, we can attach a higher weight to the observations with a larger exposure.
So that the more credible observations carry more weight in the estimation of the model coefficients
what is the idea behind using offsets in a GLM?
usually used with (not limited to) count data
we make the assumption that the target mean is directly proportional to the exposure.
how do we determine if an exposure variable should be used as an offset or a weight?
to use weights properly:
- the observations of the target variable should be averaged by exposure.
- due to the averaging, the variance of each observation is inversely related to the size of the exposure.
- the weights do not affect the mean of the target variable
offsets:
- observations are values aggregated over the exposure units.
- the exposure, when serving as an offset, is in direct proportion to the mean of the target variable
- the variance of the target variable is unaffected
what is the technique used to estimate coefficients in a GLM?
MLE instead of OLS (linear models)
what is the goodness of fit measure used in GLMs?
deviance.
why cant we use r^2 in glms to measure goodness of fit?
because r^2 operates on the assumption that the underlying distribution behind the target variable is normal.
what does deviance measure?
the extent to which the GLM departs from the most elaborate GLM (the saturated model)
the saturated model has many model parameters as the number of training observations, it perfectly fits every training observation and is a very flexible GLM.
do we want a low or high deviance?
we want a lower deviance.
the lower the deviance, the closer the GLM is to the model with a perfect fit, and the better its goodness of fit on the training set.
what is a drawback of using deviance as a goodness of fit measure for a GLM?
it can only be used to compare GLMs having the same target distribution (so that they share the same maximized log likelihood of the saturated model.
why are raw residuals not useful in a GLM?
because they are no longer normally distributed, nor do they possess the constant variance (bc their variance varies with the target mean, which varies with the different observations)
what type of residuals do we use in GLMs?
deviance residuals
deviance residuals satisfy the following properties which are parallel to those of raw residuals in a linear model (3)
- they are approximately normally distributed (not for binomial)
- they have no systematic patters when considered on their own and with respect to the predictors
- they have approx. constant variance upon standardization
why is is important for deviance residuals to be approximately normal?
because it provides the basis for comparing the distribution of the deviance residuals with the normal dist (qq plots)
for GLMs, a regularized model results from minimizing the penalized objective function given by:
deviance (goodness of fit) + regularization penalty (complexity)
what is a performance metric used for numeric target variables in GLMs?
Test RMSE
What is used to measure the performance of a binary classifier (GLM)?
we could use a classification error rate but more used is: confusion matrix
confusion matrices: how do we translate the predicted probabilities into predicted classes?
using a pre-specified cutoff.
- if the predicted probability of the event for an observation is higher than the cutoff, then the event is predicted to occur
- the the predicted probability < cutoff, then the event is not predicted to occur
what kinds of performance metrics can be calculated from confusion matrices? 4
- classification error rate
- accuracy
- sensitivity
- specificity
confusion matrices: explain what the classification error rate is and how to calculate it.
= (FP + FN) / n
this is the proportion of misclassifications
confusion matrices: explain what the accuracy measure is and how to calculate it.
= (TN + TP) / n
the proportion of correctly classified observations
confusion matrices: explain what the sensitivity measure is and how to calculate it.
= TP / ( TP + FN )
relative frequency of correctly predicting the event occurring when the event does happen
how sensitive a classifier is at identifying positive cases
confusion matrices: explain what the specificity measure is and how to calculate it.
= TN / ( TN + FP )
opposite of sensitivity
relative frequency of correctly predicting an event not to occur when it actually did not
larger specificity - better the classifier is at confirming negative cases
T/F: accuracy is a weighted average of sensitivity and specificity
(confusion matrices)
true
confusion matrices: does changing the cutoff involve a trade-off?
yes, a trade off between specificity and sensitivity.
we want them both to be as close to 1 as possible.
confusion matrices: what happens if the cutoff is set to 0? what are the sensitivity and specificity values?
all predicted probabilities will exceed the cutoff
every prediction is predicted to occur
sensitivity = 1 specificity = 0 (because no negatives)
draw what the confusion matrix will look like
confusion matrices: what happens if the cutoff increases from 0? what are the sensitivity and specificity values?
more and more observations will be classified as negative and the entries in the matrix will move to the first row
sensitivity decreases
specificity increases
draw the arrows in the confusion matrix
confusion matrices: what happens if the cutoff is set to 1? what are the sensitivity and specificity values?
all predicted probabilities will be less than the cutoff, they will all be predicted to be negative.
sensitivity = 0 specificity = 1
draw the matrix
confusion matrices: how do we choose a cutoff value? explain
using a ROC curve.
it is a graphical tool plotting the sens against the spec of a given classifier for each cutoff ranging from 0 to 1.
ROC curve: how can the predictive performance of a classifier be summarized?
by computing the AUC. the higher the better
ROC curve: what happens with AUC = 1?
the highest possible value of AUC is 1.
this classifier has perfect discriminatory power. specificity and sens both equal 1.
ROC curve: what happens with AUC = 0.5?
this is a useful baseline comparison. this is the naive classifier that classifies the observations purely randomly without using the information contained in the predictors.
what is the problem with unbalanced data in the context of a classifier?
the classifier will place more weight on the majority class and tries to match the training observations in that class, without paying enough attention to the minority class
what are two solutions to imbalanced data?
- undersampling
2. oversampling (oversamples with replacement, the minority to reduce the imbalance)
how to calculate the RMSE of a model?
RMSE function
RMSE(data.test $ target , predictions)
the predictions are made using the prediction() function:
predict(model, newdata = data.test, type = “response”)
how to look at model diagnostics, how do we output the residuals vs. fitted values plot ,etc. ?
plot( model )
how to fit a GLM using the glm() function?
glm( target ~ . + interaction, family = distribution(link = “link”), data = dataset)
when you are asked to interpret the results of a GLM, what 3-part structure could you use?
- interpret the precise values of the estimated coefficients (ex. every unit increase in a continuous predictor is associated with a multiplicative change of exp(beta) in the expected value of the target variable, holding everything else constant)
- comment on whether the sign of the estimated coefficients makes sense. (common knowledge)
- relate the findings to the business problem, how can these results help the clients?
when we use a variable in a GLM as an exposure variable, can we keep it in the original GLM?
no, we have to take it out of the model
glm(target ~ . - exposure var, family = dist(link = “link”), data = dataset)
how to include an offset in a GLM model in r?
glm ( target ~ . - exposure variable, data = data.train, offset = log(exposure variable), family = dist(link = “link”))
the offset has to be added to the model according to the link function.
how do you construct a confusion matrix in r? what package?
library (caret)
- pre-specify the cutoff value (usually the mean of the target variable)
- generate predicted values using the predict() function
- generate a subset of the predicted probabilities that are assigned a value of 0 or 1 if larger or smaller than the cutoff
ex. class <= ifelse(predictions > cutoff , 1, 0)
4. create the confusion matrix
confusionMatrix(factor(class), factor(data.test$ target), positive = “1” )
why do the two first arguments of confusionMatrix need to be factors?
not sure but they just have to be
what package has to be installed for ROC and AUC? What function is used to create the ROC curve?
pROC
roc(data.train $ target, predicted values from training set) or could do this on the test set
how to calculate the AUC?
first make the ROC curve
then,
auc(roc curve)
how do we add binarized variables to the original dataset? What needs to be done following this?
using the function cbind()
delete the old variables from the dataset using NULL
do the test/train split again
refit the model
how to model weight in a GLM?
glm(target ~ . , family = dist(link = “link”), data = data.train, weight = exp. var. )