5/6: GLM & Overdispersion Flashcards
Generalized Linear Model (GLM) & 3 components
A model for a response variable and any number of explanatory variables (continuous and/or categorical). The data-generating process can follow any distribution in the exponential dispersion family.
1. Error structure: Normal, Binomial, Poisson
2. Linear equation: represents the linear combination of predictor variables (independent variables)
3. Link Function: Logit link (used for binomial distribution).
Log link (used for Poisson distribution).
Poisson GLM
- counts (discrete data, like the number of apples), the rate (λ) can still be a fraction or decimal
- because you can not have a negative number (-3 apples) -> transform with a canonical link function -> log link (log of the expected count)
- GLM predicts an exponential relationship between the outcome and the explanatory variable when using the canonical link
- only limit at the botom, no negative
95% prediction/confidence interval difference
- prediction interval: expected range of future observations
- confidence interval: range of plausible values (for intercept/slope) -> says something about λ
Regression analysis
is a statistical method used to understand the relationship between two or more variables
Binomial GLM
- estimate how the probability of success (p) depends on the explanatory variable (x)
- binomial ratio number of successes (correct answers) out number of trials (questions)
- transform with a canonical link function -> logit link (log odds of success) converts probabilities into odds. This transformation prevents clustering near 0 and creates an S-shaped curve -> logistic regression
- Problem when applying a straight line to the data is that at less then 3 hours of studying it starts predicting negative question numbers right and after more then 18 hours of studying it shows you will get more questions right then there are questions on the exam -> need to restrict for only valid outcomes -> link function
- limit bottom and a limit at the top
Ordinary Linear Regression (OLR)
OLR shares the same assumptions as simple/multi linear regression, including the focus on normally distributed residuals.
An ordinary linear model is a GLM, and hence a normal GLM will give you the same results as an ordinary linear model
Why are model diagnostics necessary?
because the validity of the output is based on the model’s assumptions
Back transformation
is the process of applying the inverse of the link function to obtain predictions on the scale of the original response variable.
Both the logarithm and the logit functions utilize exponential back transformations to revert to their original scales
Overdispersion + 4 causes
occurs when the observed variance is larger than the variance predicted by a statistical model. In simpler terms, it’s when the spread of the data is larger than the model predicts, and it could signal that the model isn’t fully explaining what’s going on. -> the model underestimates the variance in the outcome.
Causes are:
1. zero-inflation
2. external influences
3. outliers
4. dependancy (not independent data)
DHARMa was used as diagnostic tool for multi-level/mixed models, better check then plot for GLM’s
Overdispersion: Fat tails at the bottom (more low counts) and wider spread overall.
Underdispersion: Thinner tails at the top (fewer high counts) and a more compact distribution.
S-Shape: Indicates overdispersion with wide tails in both directions, reflecting greater variability.
Overdispersion only happens in… and reasons why
Binomial and Poisson,
Lack of independence: If observations are not truly independent, meaning events or trials influence each other, like wind when flipping a coin, it adds extra variability not accounted for by the model.
Non-fixed rate/probability: If the rate (in Poisson) or probability of success (in Binomial) is not constant across observations, this also increases variability, leading to overdispersion.
In both cases, the model assumes constant rates or probabilities and independence, so violations of these assumptions create extra variability, or overdispersion.
How does overdispersion show?
- Standard errors too small
- P-values biased towards significance
- Confidence, prediction intervals too narrow
Underdispersion + 3 causes
A situation in which the observed variance in the data is less than what the model expects, which can also indicate model inadequacy. -> the model overestimates the variance in the outcome.
Causes:
1. Excessive Correlation: When observations are not independent, the variability might be reduced.
2. Model Misspecification: The model might be too simple, or important explanatory variables may be missing.
3. Overfitting: The model might fit the data too tightly, leaving little room for residual variability.
For a Poisson distribution, the variance is expected to equal the mean. Underdispersion happens when the variance is less than the mean
In a binomial model, underdispersion occurs when the observed variability is less than the expected binomial variance
Modelfit calculation over/under dispersion
Residual Deviance / Degrees of Freedom (DF)
This is for checking model fit. If the ratio of residual deviance to degrees of freedom (Residual Deviance / DF) is close to 1, it suggests that the model fits well
ratio close to 1, the model fits the data reasonably well.
ratio much larger than 1, suggests overdispersion (more variability than expected by the model).
ratio much smaller than 1, suggest underdispersion (less variability than expected).