GLM Flashcards

Question 1

Q

Definition

Answer

A

Final:
GLM is a model built to make predictions for a variable of interest (target) using the relationship between input variables (predictors) and the target. GLM models have two main components an analyst must choose. The first is the distribution which describes how the target may deviate from the mean target. The second is the link function which describes how the predictors are related to the target. Together these choices provide a prediction formula made of predictor inputs and estimated parameters.

Detailed:
GLMs are models that assume a distribution for the target and a functional relationship between the target and predictors, both chosen by the analyst. The link function refers to the part of the functional relationship that can be chosen, which eventually provides a prediction formula made of predictor inputs and estimated parameters.

Simple:
GLM is a model built to explore the relationship between variables and a target, or the item that we want to analyze. For a GLM to calculate and predict values of a target, assumptions need to be set by an analyst specifically in two areas. One is the pattern of the target, i.e. how it would change (distribution). Another one is the relationship between selected variables / conditions and the target (functional relationship or link function).

Question 2

Q

Linear Exponential Family

Answer

A

Distributions:
-Binomial - binary/Bernoulli
-Gaussian (i.e. normal) - continuous, real
-Gamma - continuous, positive only
-Inverse Gaussian - continuous, positive only
-Poisson - count
-Negative binomial - Alternative to Poisson for count targets with less restrictive variance.
-Tweedie - Many counts of 0 but otherwise continuous on positive values; partly discrete and partly continuous

It can be shown that E[Y] = mu = b’(theta) -> there is a connection between the mean and theta; the mean is embedded in the probability function.

Predictions are estimates of the mean theta

Question 3

Q

Two Modeling Choices

Answer

A

Distribution (random component)
Describes how the actual target may deviate from the mean target
select any distribution from linear exponential family
-continuous, real -> normal
-continuous, positive only -> gamma or inverse gaussian
-binary -> binomial (Bernoulli)
-count -> poisson
Link function (systematic component)
Describes how the mean target is distributed
-g(mu) = xTB
-dictates how mu is connected to the predictors, and it establishes the systematic relationship between the target and the predictors
-Link functions have different shapes and can fit different nonlinear relationships between the predictors and the target variable. When the link function matches the relationship of a predictor variable, the predictions will be closer to the actual values for the target variable which results in smaller residuals and more significant p-values
-for possible values on mu, it should output any real number
-canonical link ideal, but not only option

g(mu)
-Identity link = mu
-Logit link = ln(mu/(1-mu))
-Log link = ln(mu)

Bernoulli - logit link
Normal - identity link
Gamma - 1/mu or log link
Inverse Gaussian - 1/(mu^2) or log link
Poisson - log link

Question 4

Q

Objective function

Answer

A

GLM estimates B using maximum likelihood estimation (MLE), which finds the values of B that maximize the log-likelihood function.
-maximize l(B) or minimize -l(B)

If the algorithm fails to converge, try the canonical link

Question 5

Q

Key metrics

Answer

A

mu^ denotes the model’s prediction; mu^ = g-1(xTB)

comparing models (in absence of test set):
1. RMSE (lowest)
2. Maximized log-likelihood (highest)
3. Deviance (lowest)
4. Pearson chi-square (lowest)
5. AIC and BIC (lowest)

->to compare training to test, divide by # obs in each train/test sets (for # 2-4)

2) l_NULL; l_Sat; l(b) between the two
–A high l(b) is preferable, except when it is too close to l_Sat, suggesting an overfit
3) Deviance = 2psi[l_Sat - l(b)] -> MLR Deviance = SSE
4) Pearson chi-square statistic = (Sum (y - mu^)^2) / sigma^2 -> denominator represents the estimated variance of Y_i
5)AIC: -2l(b) + 2p BIC: -2*l(b) + ln(n) * p -> first part may be referred to as ‘deviance’

Question 6

Q

Model Performance

Answer

A

Test set preferred (RMSE or other). Otherwise, lowest AIC.
Deviance residual analysis looking for model violations, similar to MLR (continuous)
–deviance residuals that are not well behaved indicate a poor choice of distribution and/or link function

Also number of iterations - lower = less concern

Question 7

Q

Inerpretation

Answer

A

Identity Link: For every unit increase in x_j, the predicted target changes by b_j

Logit Link: For every unit increase in x_j, the predicted odds changes by a factor of exp(b_j)

Log Link: For every unit increase in x_j, the predicted target changes by a factor of exp(b_j)

–Assuming all other predictors remain constant and that x_j has no higher-order terms or interaction with another predictor

Question 8

Q

Hypothesis tests and Overdispersion

Answer

A

Hypothesis tests - same idea as MLR BUT can we trust results?? Not if overdispersion is a concern.

When the variability in the data is larger than the model’s estimated variance. Common for distributions with a restrictive variance (Poisson and Binomial).
–When deviance is noticeably larger than its degrees of freedom -> overdispersion
–When deviance is noticeably smaller than its degrees of freedom-> underdispersion

Addressing it by the quasi-likelihood method aims to make certain hypothesis tests more reliable, such as changing z tests to t tests. However, nothing about the systematic component would change, including b. Only alters the random component.

Run another GLM but use family = “quasiPoisson” or “quasiBinomial”
-does not improve model’s predictive ability
-does make hypothesis tests more reliable

Question 9

Q

Advantages and Disadvantages

Answer

A

Advantages:
–Accommodates a large variety of distribution and link function combos
–Can express the prediction with a formula
–Able to make statistical inferences

Disadvantages:
–Requires other techniques to aid in variable selection
–Validity of results highly dependent on assumptions

Question 10

Q

Regularized GLM

Answer

A

Same as MLR.
Replace SSE with -l(b).

GLM Flashcards

(10 cards)