GLM Flashcards
Definition
Final:
GLM is a model built to make predictions for a variable of interest (target) using the relationship between input variables (predictors) and the target. GLM models have two main components an analyst must choose. The first is the distribution which describes how the target may deviate from the mean target. The second is the link function which describes how the predictors are related to the target. Together these choices provide a prediction formula made of predictor inputs and estimated parameters.
Detailed:
GLMs are models that assume a distribution for the target and a functional relationship between the target and predictors, both chosen by the analyst. The link function refers to the part of the functional relationship that can be chosen, which eventually provides a prediction formula made of predictor inputs and estimated parameters.
Simple:
GLM is a model built to explore the relationship between variables and a target, or the item that we want to analyze. For a GLM to calculate and predict values of a target, assumptions need to be set by an analyst specifically in two areas. One is the pattern of the target, i.e. how it would change (distribution). Another one is the relationship between selected variables / conditions and the target (functional relationship or link function).
Linear Exponential Family
Distributions:
-Binomial - binary/Bernoulli
-Gaussian (i.e. normal) - continuous, real
-Gamma - continuous, positive only
-Inverse Gaussian - continuous, positive only
-Poisson - count
-Negative binomial - Alternative to Poisson for count targets with less restrictive variance.
-Tweedie - Many counts of 0 but otherwise continuous on positive values; partly discrete and partly continuous
It can be shown that E[Y] = mu = b’(theta) -> there is a connection between the mean and theta; the mean is embedded in the probability function.
Predictions are estimates of the mean theta
Two Modeling Choices
- Distribution (random component)
Describes how the actual target may deviate from the mean target
select any distribution from linear exponential family
-continuous, real -> normal
-continuous, positive only -> gamma or inverse gaussian
-binary -> binomial (Bernoulli)
-count -> poisson - Link function (systematic component)
Describes how the mean target is distributed
-g(mu) = xTB
-dictates how mu is connected to the predictors, and it establishes the systematic relationship between the target and the predictors
-Link functions have different shapes and can fit different nonlinear relationships between the predictors and the target variable. When the link function matches the relationship of a predictor variable, the predictions will be closer to the actual values for the target variable which results in smaller residuals and more significant p-values
-for possible values on mu, it should output any real number
-canonical link ideal, but not only option
g(mu)
-Identity link = mu
-Logit link = ln(mu/(1-mu))
-Log link = ln(mu)
Bernoulli - logit link
Normal - identity link
Gamma - 1/mu or log link
Inverse Gaussian - 1/(mu^2) or log link
Poisson - log link
Objective function
GLM estimates B using maximum likelihood estimation (MLE), which finds the values of B that maximize the log-likelihood function.
-maximize l(B) or minimize -l(B)
If the algorithm fails to converge, try the canonical link
Key metrics
mu^ denotes the model’s prediction; mu^ = g-1(xTB)
comparing models (in absence of test set):
1. RMSE (lowest)
2. Maximized log-likelihood (highest)
3. Deviance (lowest)
4. Pearson chi-square (lowest)
5. AIC and BIC (lowest)
->to compare training to test, divide by # obs in each train/test sets (for # 2-4)
2) l_NULL; l_Sat; l(b) between the two
–A high l(b) is preferable, except when it is too close to l_Sat, suggesting an overfit
3) Deviance = 2psi[l_Sat - l(b)] -> MLR Deviance = SSE
4) Pearson chi-square statistic = (Sum (y - mu^)^2) / sigma^2 -> denominator represents the estimated variance of Y_i
5)AIC: -2l(b) + 2p BIC: -2*l(b) + ln(n) * p -> first part may be referred to as ‘deviance’
Model Performance
Test set preferred (RMSE or other). Otherwise, lowest AIC.
Deviance residual analysis looking for model violations, similar to MLR (continuous)
–deviance residuals that are not well behaved indicate a poor choice of distribution and/or link function
Also number of iterations - lower = less concern
Inerpretation
Identity Link: For every unit increase in x_j, the predicted target changes by b_j
Logit Link: For every unit increase in x_j, the predicted odds changes by a factor of exp(b_j)
Log Link: For every unit increase in x_j, the predicted target changes by a factor of exp(b_j)
–Assuming all other predictors remain constant and that x_j has no higher-order terms or interaction with another predictor
Hypothesis tests and Overdispersion
Hypothesis tests - same idea as MLR BUT can we trust results?? Not if overdispersion is a concern.
When the variability in the data is larger than the model’s estimated variance. Common for distributions with a restrictive variance (Poisson and Binomial).
–When deviance is noticeably larger than its degrees of freedom -> overdispersion
–When deviance is noticeably smaller than its degrees of freedom-> underdispersion
Addressing it by the quasi-likelihood method aims to make certain hypothesis tests more reliable, such as changing z tests to t tests. However, nothing about the systematic component would change, including b. Only alters the random component.
Run another GLM but use family = “quasiPoisson” or “quasiBinomial”
-does not improve model’s predictive ability
-does make hypothesis tests more reliable
Advantages and Disadvantages
Advantages:
–Accommodates a large variety of distribution and link function combos
–Can express the prediction with a formula
–Able to make statistical inferences
Disadvantages:
–Requires other techniques to aid in variable selection
–Validity of results highly dependent on assumptions
Regularized GLM
Same as MLR.
Replace SSE with -l(b).