5/6: GLM & Overdispersion Flashcards

Question 1

Q

Generalized Linear Model (GLM) & 3 components

Answer

A

A model for a response variable and any number of explanatory variables (continuous and/or categorical). The data-generating process can follow any distribution in the exponential dispersion family.
1. Error structure: Normal, Binomial, Poisson
2. Linear equation: represents the linear combination of predictor variables (independent variables)
3. Link Function: Logit link (used for binomial distribution).
Log link (used for Poisson distribution).

Question 2

Q

Poisson GLM

Answer

A

counts (discrete data, like the number of apples), the rate (λ) can still be a fraction or decimal
because you can not have a negative number (-3 apples) -> transform with a canonical link function -> log link (log of the expected count)
GLM predicts an exponential relationship between the outcome and the explanatory variable when using the canonical link

only limit at the botom, no negative

Question 3

Q

95% prediction/confidence interval difference

Answer

A

prediction interval: expected range of future observations
confidence interval: range of plausible values (for intercept/slope) -> says something about λ

Question 4

Q

Regression analysis

Answer

A

is a statistical method used to understand the relationship between two or more variables

Question 5

Q

Binomial GLM

Answer

A

estimate how the probability of success (p) depends on the explanatory variable (x)
binomial ratio number of successes (correct answers) out number of trials (questions)
transform with a canonical link function -> logit link (log odds of success) converts probabilities into odds. This transformation prevents clustering near 0 and creates an S-shaped curve -> logistic regression

Problem when applying a straight line to the data is that at less then 3 hours of studying it starts predicting negative question numbers right and after more then 18 hours of studying it shows you will get more questions right then there are questions on the exam -> need to restrict for only valid outcomes -> link function
limit bottom and a limit at the top

Question 6

Q

Ordinary Linear Regression (OLR)

Answer

A

OLR shares the same assumptions as simple/multi linear regression, including the focus on normally distributed residuals.

An ordinary linear model is a GLM, and hence a normal GLM will give you the same results as an ordinary linear model

Question 7

Q

Why are model diagnostics necessary?

Answer

A

because the validity of the output is based on the model’s assumptions

Question 8

Q

Back transformation

Answer

A

is the process of applying the inverse of the link function to obtain predictions on the scale of the original response variable.

Both the logarithm and the logit functions utilize exponential back transformations to revert to their original scales

Question 9

Q

Overdispersion + 4 causes

Answer

A

occurs when the observed variance is larger than the variance predicted by a statistical model. In simpler terms, it’s when the spread of the data is larger than the model predicts, and it could signal that the model isn’t fully explaining what’s going on. -> the model underestimates the variance in the outcome.
Causes are:
1. zero-inflation
2. external influences
3. outliers
4. dependancy (not independent data)

DHARMa was used as diagnostic tool for multi-level/mixed models, better check then plot for GLM’s
Overdispersion: Fat tails at the bottom (more low counts) and wider spread overall.
Underdispersion: Thinner tails at the top (fewer high counts) and a more compact distribution.
S-Shape: Indicates overdispersion with wide tails in both directions, reflecting greater variability.

Question 10

Q

Overdispersion only happens in… and reasons why

Answer

A

Binomial and Poisson,
Lack of independence: If observations are not truly independent, meaning events or trials influence each other, like wind when flipping a coin, it adds extra variability not accounted for by the model.

Non-fixed rate/probability: If the rate (in Poisson) or probability of success (in Binomial) is not constant across observations, this also increases variability, leading to overdispersion.

In both cases, the model assumes constant rates or probabilities and independence, so violations of these assumptions create extra variability, or overdispersion.

Question 11

Q

How does overdispersion show?

Answer

A

Standard errors too small
P-values biased towards significance
Confidence, prediction intervals too narrow

Question 12

Q

Underdispersion + 3 causes

Answer

A

A situation in which the observed variance in the data is less than what the model expects, which can also indicate model inadequacy. -> the model overestimates the variance in the outcome.
Causes:
1. Excessive Correlation: When observations are not independent, the variability might be reduced.
2. Model Misspecification: The model might be too simple, or important explanatory variables may be missing.
3. Overfitting: The model might fit the data too tightly, leaving little room for residual variability.

For a Poisson distribution, the variance is expected to equal the mean. Underdispersion happens when the variance is less than the mean
In a binomial model, underdispersion occurs when the observed variability is less than the expected binomial variance

Question 13

Q

Modelfit calculation over/under dispersion

Answer

A

Residual Deviance / Degrees of Freedom (DF)
This is for checking model fit. If the ratio of residual deviance to degrees of freedom (Residual Deviance / DF) is close to 1, it suggests that the model fits well

ratio close to 1, the model fits the data reasonably well.
ratio much larger than 1, suggests overdispersion (more variability than expected by the model).
ratio much smaller than 1, suggest underdispersion (less variability than expected).