Chapter 4: GLMs Flashcards

1
Q

what is a GLM?

A

Comparatively, a GLM is more flexible than a linear model.

GLMs provide flexibility in two aspects:

  1. Distribution of the target variable:
    - target variable is not confined to normal distribution. it only needs to be a part of the linear exponential family (contains both continuous and discrete)
    - GLMs provide a unifying approach to modelling binary, discrete and continuous target variables with different mean-variance relationships
  2. Relationship between target mean and the linear predictors
    - instead of equating the target mean of the target variable directly with the linear combination, a GLM set a function of the target mean to be linearly related to the predictors.
    - the link function can be any monotonic function (monotonic bc needs to be invertible)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

can we use all feature generation (binarization, polynomial terms, interaction terms) and all feature selection techniques for GLM models?

A

yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what does it mean to say “transformations are applied internally vs. externally)

A

GLMs: internally transforming the data
- the target variable is not transformed and the transformation plays its role only within the GLM itself

Linear models: externally transforming the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what target distribution would we choose for a positive, continuous, right- skewed target variable?

A

gamma and inverse gaussian capture the skewness of the target variable directly without the use of transformations

inv. gaussian is more highly skewed than gamma

gamma is #1 choice here

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what target distribution would we choose for a binary target variable?

A

binomial

the mean of the target variable is the probability that the event of interest occurs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what target distribution would we choose for a count variable?

A

count variable = represents the number of times a cretain event of interest happens over a reference time period

these variables only have non-negative integer values.

poission!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what target distribution would we choose for aggregate data?

A

tweedie. it is a poisson-gamma mixture.

discrete probability mass at zero and pdf on the positive real line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

when is it good to use a log link?

A

for poisson, gamma and inverse gaussian

the target mean is positive and unbounded from above

g(mu) = ln(mu)
so then the inverse is exp(predictors) - which results in a positive value for target mean unbounded from above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

when is it a good idea to use a logit link?

A

binary variables
logit link = ln(odds)

the logit link ensures that the target mean is between 0 and 1 (needs to be for a binary variable)
but the predictors can have an value from 0 to +inf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a logisitic regression model?

A

a GLM with a binary target variable and a logit link function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

two factors need to be considered when choosing a link function:

A
  1. whether the predictions provided by the link align with the characteristics of the target variable
  2. whether the resulting GLM is easy to interpret (ex. logit is easier to interpret than probit)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

T/F: the link function is to transform the target variable of a GLM so that the resulting distribution more closely resembles a normal distribution

A

false.

the link function is applied to the mean of the target variable; the target variable itself is left untransformed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

T/F: the main reason for using the log link in a GLM is to reduce the skewness of a non-negative, right skewed target variable.

A

false. The log link is chosen because it ensures appropriate predictions and eases model interpretation. The skewness can be accommodated by an appropriate target distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

T/F: if some of the observations of the target variable are 0, then the log link cannot be used because ln(0) is not defined.

A

False.

the log link is not applied directly to the target variable itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what two link functions are easiest to interpret ?

A

logit and log link

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how to interpret GLM coefficients with log link for numeric predictors?

A

Multiplicative changes:
when all other variables are held fixed, a unit increase in X is associated with a multiplicative increase in the target mean by a factor of exp(beta)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

how to interpret GLM coefficients with log link for categorical predictors?

A

if X is a dummy variable, then:
- at the baseline level, X = 0, and at the non-baseline level, X = 1.

SO
comparing the means, we see that the target mean when the categorical predictor lies in the non-baseline level is exp(beta) times of that when the categorical predictor is in the baseline level, holding all other predictors fixed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

how to interpret GLM coefficients with logit link?

A

the logit link is almost always used with binary data.
ln(odds) = f(x) or odds = exp( f(x) )

the interpretations are just phrased in terms of multiplicative changes in the odds of the event of interest.

a unit increase in a numeric predictor with coefficient beta is associated with a multiplicative change of exp(beta) in the odds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what are weights and offsets?

A

modeling tools that are commonly used with GLMs.

they are designed to incorporate a measure of exposure into a GLM to improve the fitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

what is the idea behind using weights in a GLM?

A

to take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision, we can attach a higher weight to the observations with a larger exposure.

So that the more credible observations carry more weight in the estimation of the model coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the idea behind using offsets in a GLM?

A

usually used with (not limited to) count data

we make the assumption that the target mean is directly proportional to the exposure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

how do we determine if an exposure variable should be used as an offset or a weight?

A

to use weights properly:

  • the observations of the target variable should be averaged by exposure.
  • due to the averaging, the variance of each observation is inversely related to the size of the exposure.
  • the weights do not affect the mean of the target variable

offsets:
- observations are values aggregated over the exposure units.
- the exposure, when serving as an offset, is in direct proportion to the mean of the target variable
- the variance of the target variable is unaffected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

what is the technique used to estimate coefficients in a GLM?

A

MLE instead of OLS (linear models)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

what is the goodness of fit measure used in GLMs?

A

deviance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

why cant we use r^2 in glms to measure goodness of fit?

A

because r^2 operates on the assumption that the underlying distribution behind the target variable is normal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

what does deviance measure?

A

the extent to which the GLM departs from the most elaborate GLM (the saturated model)

the saturated model has many model parameters as the number of training observations, it perfectly fits every training observation and is a very flexible GLM.

27
Q

do we want a low or high deviance?

A

we want a lower deviance.

the lower the deviance, the closer the GLM is to the model with a perfect fit, and the better its goodness of fit on the training set.

28
Q

what is a drawback of using deviance as a goodness of fit measure for a GLM?

A

it can only be used to compare GLMs having the same target distribution (so that they share the same maximized log likelihood of the saturated model.

29
Q

why are raw residuals not useful in a GLM?

A

because they are no longer normally distributed, nor do they possess the constant variance (bc their variance varies with the target mean, which varies with the different observations)

30
Q

what type of residuals do we use in GLMs?

A

deviance residuals

31
Q

deviance residuals satisfy the following properties which are parallel to those of raw residuals in a linear model (3)

A
  1. they are approximately normally distributed (not for binomial)
  2. they have no systematic patters when considered on their own and with respect to the predictors
  3. they have approx. constant variance upon standardization
32
Q

why is is important for deviance residuals to be approximately normal?

A

because it provides the basis for comparing the distribution of the deviance residuals with the normal dist (qq plots)

33
Q

for GLMs, a regularized model results from minimizing the penalized objective function given by:

A

deviance (goodness of fit) + regularization penalty (complexity)

34
Q

what is a performance metric used for numeric target variables in GLMs?

A

Test RMSE

35
Q

What is used to measure the performance of a binary classifier (GLM)?

A

we could use a classification error rate but more used is: confusion matrix

36
Q

confusion matrices: how do we translate the predicted probabilities into predicted classes?

A

using a pre-specified cutoff.

  • if the predicted probability of the event for an observation is higher than the cutoff, then the event is predicted to occur
  • the the predicted probability < cutoff, then the event is not predicted to occur
37
Q

what kinds of performance metrics can be calculated from confusion matrices? 4

A
  1. classification error rate
  2. accuracy
  3. sensitivity
  4. specificity
38
Q

confusion matrices: explain what the classification error rate is and how to calculate it.

A

= (FP + FN) / n

this is the proportion of misclassifications

39
Q

confusion matrices: explain what the accuracy measure is and how to calculate it.

A

= (TN + TP) / n

the proportion of correctly classified observations

40
Q

confusion matrices: explain what the sensitivity measure is and how to calculate it.

A

= TP / ( TP + FN )
relative frequency of correctly predicting the event occurring when the event does happen

how sensitive a classifier is at identifying positive cases

41
Q

confusion matrices: explain what the specificity measure is and how to calculate it.

A

= TN / ( TN + FP )

opposite of sensitivity
relative frequency of correctly predicting an event not to occur when it actually did not

larger specificity - better the classifier is at confirming negative cases

42
Q

T/F: accuracy is a weighted average of sensitivity and specificity
(confusion matrices)

A

true

43
Q

confusion matrices: does changing the cutoff involve a trade-off?

A

yes, a trade off between specificity and sensitivity.

we want them both to be as close to 1 as possible.

44
Q

confusion matrices: what happens if the cutoff is set to 0? what are the sensitivity and specificity values?

A

all predicted probabilities will exceed the cutoff
every prediction is predicted to occur

sensitivity = 1 
specificity = 0 (because no negatives)

draw what the confusion matrix will look like

45
Q

confusion matrices: what happens if the cutoff increases from 0? what are the sensitivity and specificity values?

A

more and more observations will be classified as negative and the entries in the matrix will move to the first row

sensitivity decreases
specificity increases

draw the arrows in the confusion matrix

46
Q

confusion matrices: what happens if the cutoff is set to 1? what are the sensitivity and specificity values?

A

all predicted probabilities will be less than the cutoff, they will all be predicted to be negative.

sensitivity = 0 
specificity = 1

draw the matrix

47
Q

confusion matrices: how do we choose a cutoff value? explain

A

using a ROC curve.

it is a graphical tool plotting the sens against the spec of a given classifier for each cutoff ranging from 0 to 1.

48
Q

ROC curve: how can the predictive performance of a classifier be summarized?

A

by computing the AUC. the higher the better

49
Q

ROC curve: what happens with AUC = 1?

A

the highest possible value of AUC is 1.

this classifier has perfect discriminatory power. specificity and sens both equal 1.

50
Q

ROC curve: what happens with AUC = 0.5?

A

this is a useful baseline comparison. this is the naive classifier that classifies the observations purely randomly without using the information contained in the predictors.

51
Q

what is the problem with unbalanced data in the context of a classifier?

A

the classifier will place more weight on the majority class and tries to match the training observations in that class, without paying enough attention to the minority class

52
Q

what are two solutions to imbalanced data?

A
  1. undersampling

2. oversampling (oversamples with replacement, the minority to reduce the imbalance)

53
Q

how to calculate the RMSE of a model?

A

RMSE function

RMSE(data.test $ target , predictions)

the predictions are made using the prediction() function:
predict(model, newdata = data.test, type = “response”)

54
Q

how to look at model diagnostics, how do we output the residuals vs. fitted values plot ,etc. ?

A

plot( model )

55
Q

how to fit a GLM using the glm() function?

A

glm( target ~ . + interaction, family = distribution(link = “link”), data = dataset)

56
Q

when you are asked to interpret the results of a GLM, what 3-part structure could you use?

A
  1. interpret the precise values of the estimated coefficients (ex. every unit increase in a continuous predictor is associated with a multiplicative change of exp(beta) in the expected value of the target variable, holding everything else constant)
  2. comment on whether the sign of the estimated coefficients makes sense. (common knowledge)
  3. relate the findings to the business problem, how can these results help the clients?
57
Q

when we use a variable in a GLM as an exposure variable, can we keep it in the original GLM?

A

no, we have to take it out of the model

glm(target ~ . - exposure var, family = dist(link = “link”), data = dataset)

58
Q

how to include an offset in a GLM model in r?

A

glm ( target ~ . - exposure variable, data = data.train, offset = log(exposure variable), family = dist(link = “link”))

the offset has to be added to the model according to the link function.

59
Q

how do you construct a confusion matrix in r? what package?

A

library (caret)

  1. pre-specify the cutoff value (usually the mean of the target variable)
  2. generate predicted values using the predict() function
  3. generate a subset of the predicted probabilities that are assigned a value of 0 or 1 if larger or smaller than the cutoff

ex. class <= ifelse(predictions > cutoff , 1, 0)
4. create the confusion matrix

confusionMatrix(factor(class), factor(data.test$ target), positive = “1” )

60
Q

why do the two first arguments of confusionMatrix need to be factors?

A

not sure but they just have to be

61
Q

what package has to be installed for ROC and AUC? What function is used to create the ROC curve?

A

pROC

roc(data.train $ target, predicted values from training set) or could do this on the test set

62
Q

how to calculate the AUC?

A

first make the ROC curve

then,

auc(roc curve)

63
Q

how do we add binarized variables to the original dataset? What needs to be done following this?

A

using the function cbind()

delete the old variables from the dataset using NULL

do the test/train split again

refit the model

64
Q

how to model weight in a GLM?

A

glm(target ~ . , family = dist(link = “link”), data = data.train, weight = exp. var. )