GLMs Flashcards
Explain how a linear model can be viewed as a generalized linear model
A linear model is a special case of a GLM when the target variable is normally distributed and the link function is the identity function
Explain the difference between weights and offsets when applied to a GLM
Weights and offsets both take exposure into account to improve fitting, however, the key differences are:
* Weights: the observations of the target variable should be averaged by exposure –> the variance of each observation is inversely related to the size of exposure –> weights do not affect the mean of the target variable
* Offsets: the observations of the target variable are aggregated over the exposure –> the mean of the target variable is in direct proportion to the exposure, but its variance is unaffected
State the statistical method typically used to estimate the parameters of a GLM
Maximum Likelihood Estimation (MLE)
Explain the problem with deviance as a model selection criterion
Deviance is merely a goodness-of-fit measure on the training set and always decreases/never increases when new predictors are added. A GLM with the smallest deviance arrives at the most elaborate GLM, with the lowest training error but not necessarily the lowest test error, and is likely overfitted.
Explain the limitations of the likelihood ratio test as a model selection method
- It can only be used to compare one pair of GLMs at a time
- The simpler GLM must be a special case/nested within the more complex GLM in order to use LRT
Explain the importance of setting a cutoff for a binary classifier
It’s important to set a cutoff for a binary classifier to translate the predicted probabilities into predicted classes. For ex., we want to know whether we test positive or negative for COVID, not the predicited probability of getting infected
Explain the relationship between accuracy, sensitivity, and specificity
Accuracy is a weighted average of specificity and sensitivity, where the weights are the proportions of observations belonging to the two classes
* Sensitivity = proportion of correctly classified positive observations
* Specificity = proportion of correctly classified negative observations
Explain how the cutoff of a binary classifier affects sensitivity and specificity
The selection of the cutoff for a binary classifier involves a trade-off between having high sensitivity and having high specificity
* cutoff = 0 –> all observations are positive. sensitivity = 1 and specificity = 0.
* cutoff increases –> more negative observations, which means more true negatives and fewer false positives. sensitivity decreases and specificity increases
* cutoff = 1 –> all observations are negative. sensitivity = 0 and specificity = 1.
Explain the problem with unbalanced data
A classifier implicitly places more weight on the majority class without paying enough attention to the minority class. The problem with this is the fact that a high accuracy might be deceptive
Explain how undersampling and oversampling work to make unbalanced data more balanced
Undersampling produces roughly balanced data by drawing fewer observations from the negative class and retaining all of the positive observations. However, less data means the training becomes less robust and the classifier becomes more prone to overfitting
Oversampling produces roughly balanced data by retaining all observations in the dataset, but oversampling from the positive class. However, more data means a heavier computational burden
Explain why oversampling must be performed after splitting the full data into training and test data
Oversampling keeps all the original data, but oversamples with replacement the positive class to reduce the imbalance between the two classes. If oversampling is not performed after splitting the data, some of the positive class observations may appear in both training and test sets, and the test set will not be truly unseen to the trained classifier.
Explain one reason for using oversampling over undersampling, and one reason for using undersampling over oversampling
Oversampling can be used to retain the information about the positive class.
Undersampling can be used to ease the computational burden and reduce run time when the training data is excessively large
Explain the Tweedie distribution
The tweedie distribution has a mixture of discrete and continuous components.
* Tweedie is an “in-between” distribution of Poisson and gamma; a Poisson sum of gamma random variables
* Tweedie has a discrete probability mass at zero and a probability density function on the positive real line.
What are canonical links?
Canonical links have the advantage of simplifying the mathematics of the estimation process and making it more likely to converge, but they shouldn’t always be used.
Normal: Identity (u)
Binomial: Logit (ln[pi/(1-pi)])
Poisson: Log (lnu)
Gamma: Inverse (1/u)
Inverse Gaussian: Squared inverse (1/u^2)
Ex. the canonical link for gamma, inverse, does not guarantee positive predictions nor is it easy to interpret, so log link is more commonly used
Debunk a common misconception about link functions and GLMs
Link functions are applied to the mean of the target variable, and leaves the target variable untransformed. For example, if log link is used, it’s fine for some of the observations of the target variable to be zero because the log link is not applied to the target observations
Why might we apply weights and offsets when fitting GLMs?
To take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision. The goal is to improve the reliability of the fitting procedure
What happens when we use logged exposure as an offset?
We are assuming that the target mean varies in direct proportion to E (not lnE)
Explain the pros and cons of using MLE
Pros: MLE produces estimates with desirable statistical properties such as asymptotic unbiasedness, efficiency, and normality
Cons: The optimization algorithim for MLE is occassionally plagued by convergence issues (which may happen when a non-canonical link is used) which means no estimates may be produced and the GLM cannot be fitted or applied as a result
What are some facts about deviance?
- Deviance reduces to RSS for linear models, which means deviance can be seen as a generalization of RSS that works for non-normal target variables in the exponential family
- Deviance should only be used to compare GLMs having the same target distribution (so they have the same maximum loglikelihood of the saturated model)
- Deviance provides the foundation of an important model diagnostic tool for GLMs, the deviance residual
What are the properties of deviance residuals (that parallel those of the raw residuals of a linear model)?
- They are approximately normally distributed for most target distributions in the linear exponential family (except binomial) –> qq plots for deviance residuals –> valid even if the target distribution is not normal
- They have no systematic patterns when considered on their own and with respect to the predictors
- They have approximately constant variance upon standardization (using the standard error implied by the GLM)
What does it mean when observed points on the qq plot for deviance residuals are far from the reference straight line?
The target distribution, link function, and/or the form of the model equation may not be appropriate
Explain what standardized deviance residuals are
They are deviance residuals scaled by their standard error implied by the GLM, and should be approximately homoscedastic if a correct model is used
Explain what an AUC of almost 1, 0.5, and near 0 indicates
A perfect model that predicts the correct class for new data each time will have a ROC plot showing the curve approaching the top left corner with an AUC near 1.0 (perfect classifier)
When a model has an AUC of 0.5, it classifies the observations purely randomly without using the information contained in the predictors (naive classifier). Any model having an AUC less than 0.5 means it is providing predictions that are worse than random selection, with a near 0 AUC indicating that the model makes the wrong classification almost every time.
What is overdispersion?
When the variance of the target variable exceeds its mean. For example, the Poisson distribution can be used to model count variables, but it’s vulnerable to the problem of overdispersion (Poisson requires its mean and variance be equal)