3.4 Generalized Linear Models Flashcards
How does a generalized linear model (GLM) extend from a multiple linear regression (MLR)?
GLM extends MLR by allowing for non-normal distributions of the target variable and (through the link function) non-linear relationships between the target and the linear predictor.
Describe what characterizes the linear exponential family of distributions. What are some common distributions in this family?
Common distributions include normal, binomial, Poisson, gamma, and inverse Gaussian.
In GLM’s, what are the two crucial modeling choices that must be made? How do they relate to the modeling components of a GLM?
The two crucial modeling choices are the distribution, which relates to the random component, and the link function, which relates to the systematic component by connecting the mean of the target to the linear combination of predictors.
Explain the role of the link function in GLM’s. What characteristics should a good link function have?
The link function connects the mean of the target variable (μ) to the predictors and their coefficients:
A good link function should map the range of possible μ values to the entire real number line. Also, a good link function should ensure that the predicted target mean matches with the range of the target distribution implied by the GLM
How does parameter estimation in GLM differ from that in MLR?
GLMs use maximum likelihood estimation (MLE) instead of ordinary least squares (OLS).
What are the key metrics used to assess GLM performance? How do they differ from those used in MLR?
Key metrics for GLMs include maximized log-likelihood, deviance, Pearson chi-square statistic, AIC, and BIC. Unlike MLR, these metrics are based on the likelihood function rather than squared residuals.
Describe the process of residual analysis in GLM’s. How does it differ from residual analysis in MLR?
In GLMs, regular or raw residuals are not typically examined. Instead, deviance residuals or Pearson residuals are used. The interpretation of residual plots is similar to MLR, but the underlying assumptions differ.
What is logistic regression, and how does it handle binary target variables?
Logistic regression is a GLM for binary targets using the Bernoulli distribution with a logit link function. It models the log-odds of the target being 1 as a linear function of predictors.
Explain how coefficients are interpreted in logistic regression (i.e., with the logit link). How does this differ from linear regression (i.e., with the identity link)?
In logistic regression, the coefficients represent the change in log-odds for a unit increase in the predictor. Exponentiating the coefficient provides the multiplicative change in odds. This contrasts with linear regression where the coefficients represent the change in the target variable for a unit increase in the predictor.
What is the ROC curve, and how is it used to assess binary classification models?
The ROC curve plots sensitivity against 1 minus specificity for all possible cutoff values. The area under this curve (AUC) measures overall model performance, with higher values indicating better performance.
Describe Poisson regression. When is it typically used, and what link function is commonly associated with it?
Poisson regression is used for modeling count data. It typically employs the log link function, which ensures positive predictions and serves as the canonical link for the Poisson distribution.
What are exposures in Poisson regression, and how are they incorporated into the model?
In Poisson regression, exposures represent the interval over which counts are measured.
They are incorporated as an offset term in the model:
ln(μi) = ln(wi) + X_trans*beta
Explain how coefficients are interpreted with the log link. How does this differ from the identity link?
With the log link, the coefficients represent the change in the log of the mean for a unit increase in the predictor. Exponentiating the coefficient gives the multiplicative change in the mean. This contrasts with the identity link where the coefficients represent the change in the mean for a unit increase in the predictor.
Explain the concepts of overdispersion and underdispersion in GLM’s. How can they be detected?
verdispersion occurs when the variability in the data is larger than the model’s estimated variance. Underdispersion is the opposite. They can be detected by comparing the deviance to its degrees of freedom. If the deviance is greater than the degrees of freedom, it indicates overdispersion; otherwise, it suggests underdispersion.
How are dispersion issues addressed in R for binary targets and count targets?
In R, dispersion issues are addressed using quasi-likelihood models. For binary targets, the quasibinomial() function is used, while for count targets, the quasipoisson() function is applied.
Describe how GLM’s are implemented in R. What are some key functions and arguments used?
GLMs are implemented in R using the glm() function. Key arguments include:
- formula: Specifies the model structure.
- family: Specifies the distribution and link function.
- data: Specifies the dataset.
- weights: For weighted GLMs.
- offset: For including offset terms (e.g., exposures in Poisson regression).