Models for Count Data I Flashcards

Question 1

Q

What are count variables?

Answer

A

Count variables are discrete and take non- negative integer values (0, 1, 2, …) and represent the number of occurrences of an event

Question 2

Q

Give three examples of count variables:

Answer

A

Number of hospital visits, number of deaths by horse kick, and number of appointments with a counsellor

Question 3

Q

What must be considered when measuring count variables over different time periods or populations?

Answer

A

Counts should be adjusted using a rate (e.g., number of crimes per 100,000 people)

Question 4

Q

What is the formula for the Poisson probability mass function?

Answer

A

P ( Y = y ) = (μ^y^e − μ) / y!
- μ is the expected (mean) count (mean number of times that event occurs)
- Let Y be a random (count) variable that indicates the number of times a certain event occurs

Question 5

Q

What is a key property of the Poisson distribution?

Answer

A

The mean and variance are equal (equidispersion)

Question 6

Q

What happens the mean μ of a Poisson distribution is large?

Answer

A

It approximates a normal distribution

Question 7

Q

What is the general form of a Poisson regression model?

Answer

A

log(μi) = β0 + β1X1i + … + βpXpi
- Where μi is the expected count

Question 8

Q

Why do we use a log transformation in a Poisson regression?

Answer

A

To ensure predicted counts are always positive

Question 9

Q

What assumptions must be met for Poisson regression?

Answer

A

The outcome is a count variable (non-negative integers)
The variance equals the mean (no overdispersion). This implies heteroscedascity (different to what’s seen in the normal distribution): the predicted variance depends on the predicted mean.
Observations are independent (e.g., no clustering)
The transformed outcome (log(μ)) is linearly related to continuous predictors
No multicollinearity
Each subject’s count is measured over the same unit of time or space, or the same population size

Question 10

Q

What is overdispersion in Poisson regression?

Answer

A

When the variance is larger than the mean, suggesting a need for a different model

Question 11

Q

What can cause overdispersion?

Answer

A

Excess zeroes (zero-inflation)
An important predictor is missing
A highly skewed count variable

Question 12

Q

What models can be used to handle overdispersion?

Answer

A

Negative binomial regression and zero-inflated Poisson models

Question 13

Q

How do we interpret a coefficient β in Poisson regression?

Answer

A

The exponentiated coefficient e^β represents the incident rate ratio (IRR)

Question 14

Q

What does an IRR indicate?

Answer

A

IRR = 1: No effect of predictor
IRR > 1: Predictor increases the outcome rate
IRR < 1: Predictor decreases the outcome rate

Question 15

Q

How do we test for the significance of variables in Poisson regression?

Answer

A

Using an LR

Question 16

Q

What is an offset in Poisson regression?

Answer

A

A term added to account for different observation periods, population sizes, or area sizes
This may involve devising a rate

Question 17

Q

How do we include an offset in Stata?

Answer

A

poisson <outcome> <predictor(s)>, exposure(offset_variable)</outcome>

Question 18

Q

What is an example of using an offset?

Answer

A

Analysing the number of crimes per 1,000 residents rather than total crimes

Question 19

Q

What command is used for a basic Poisson regression in Stata?

Answer

A

poisson <outcome> <predictor(s)></outcome>

Question 20

Q

How do we check if a Poisson model fits the data well?

Answer

A

Compare observed vs. predicted counts using the prcounts command

Question 21

Q

How do we test for overdispersion?

Answer

A

Compare a Poisson model with a negative binomial model

Question 22

Q

Among the numeric variables, what two types can be established?

Answer

A

Continuous: e.g., age, height, blood pressure, etc. They can take the form of fractions
Discrete: e.g., number of siblings, number of hospital visits, etc. i.e., things you can actually count
Some variables are strictly speaking discrete, but in practice can be treated as continuous such as household income

Question 23

Q

What chart do we use to display count variables?

Answer

A

Bar (not histogram)
- Each bar represents one number
- Spaces between bars because of discrete values, not continuous

Question 24

Q

What statistical distribution can we use to model count variables?

Answer

A

Poisson distribution

Question 25

Q

What does the Poisson distribution specify?

Answer

A

The relationship between the expected count μ and the probability of observing any observed count y

Question 26

Q

What is the notation to denote that Y follows a Poisson distribution with mean μ?

Answer

A

Y ~ Poisson(μ)

Question 27

Q

Describe the Poisson distribution with μ = 1

Answer

A

Most of the counts will be zero or 1, and higher counts than 1 are rarer. Distribution is highly skewed

Question 28

Q

Describe the Poisson distribution with μ = 4

Answer

A

Mostly seeing 3s and 4s

Question 29

Q

Describe the Poisson distribution with μ = 10

Answer

A

Distribution appears normal and symmetric

Question 30

Q

What does the shape of the distribution depend on?

Answer

A

The μ parameter

Question 31

Q

What happens to the variance as μ increases?

Answer

A

As μ = variance, as the mean increases, so does the variance

Question 32

Q

Properties of the Poisson distribution:

Answer

A

One parameter, μ, which is equal to the mean and variance (equidispersion)
Positive skew - although the shape of the distribution depends on the mean

Question 33

Q

What is the probability density function of the normal distribution?

Answer

A

f(y) = 1 / 2 √ πσ2^e( y - μ)2 / 2σ2

Question 34

Q

What are the properties of the normal distribution?

Answer

A

Two parameters: the mean μ and variance σ2, which are independent of one another

Question 35

Q

What is the notation that denotes “Y is normally distributed with mean μ and variance σ2?

Answer

A

Y ~ N ( μ , σ2 )

Question 36

Q

How can we initially check whether the Poisson distribution might be an adequate model for our outcome?

Answer

A

Graphical comparison of observed proportions and Poisson predicted probabilities, using the observed mean of Y as an estimate of μ

Question 37

Q

What kind of numbers do the count and mean have to be?

Answer

A

Count has to be a whole number, but the mean doesn’t have to be

Question 38

Q

Why may our observation not follow the Poisson distribution?

Answer

A

Overdispersion: The variance is larger than the mean. This is a frequent occurrence in practice e.g., length of hospital stay (in days) typically has a long tail to the left
Excess zeroes: Observe more zero counts than expected by the Poisson for a given mean e.g., number of accidents in a workplace (more days have no accidents)
No zeroes (zero-truncation): There are no zeros in the data by design, or because we had no chance to observe them e.g., number of appointments of psychotherapy clients with their therapist (you only become a client after attending the first appointment, so we don’t observe clients with zero appointments)

Question 39

Q

How may overdispersion be distributed?

Answer

A

Fatter tails on both sides

Question 40

Q

How was SARS-CoV-2 an example of overdispersion?

Answer

A

R number - modelling the Poisson regression, distribution would be overdispersed as most people would have stayed at home to reduce infection, resulting in a smaller amount of people infecting many others.
Variance would have been much higher than the mean

Question 41

Q

What is the equation for the Poisson regression model?

Answer

A

log(μi) = β0 + β1x1i + β2x2i + … + βpxpi
log(μi) - can model log of the mean for the ith case
Same righthand side as other hitherto seen regression models (linear predictor)

Question 42

Q

How can we rewrite the Poisson regression equation for the mean rather than the log of the mean?

Answer

A

Exponentiate both sides:
μi = exp(β0 + β0xi1 + β0xi1 + … + βpxip)
The log-transformation ensures that our predicted counts from Poisson distribution, μi, cannot be negative (to be a count variable, there cannot be any negative values)

Question 43

Q

What should you do if you have overdispersion and/or excess zeroes or no zeroes?

Answer

A

If assumptions of Poisson regression aren’t met, it’s generally advisable to use another model.
Sometimes, residual overdispersion or excess zeroes may result from failure to include an importance predictor

Question 44

Q

Why is using linear regression not advisable for count outcomes?

Answer

A

Can result in negative predicted counts i.e., non-sensical predictions such as -3 appointments with a therapist
Count data often violate assumption of homoscedasticity (Poisson regression assumes heteroscedasticity)
The log transformation used in Poisson regression often gives better predictions

Question 45

Q

When may linear regression be appropriate for count outcomes?

Answer

A

When the mean of the count outcome is large (Poisson distributions with a large mean look similar to a normal distribution with the same mean)

Question 46

Q

What is homoscedascticity in linear regression?

Answer

A

The variance of residuals is the same at each value of x no mater the mean of linear regression. The scatter of points of x = 30 is the same as x = 50

Question 47

Q

What is heteroscedasticity in Poisson regression?

Answer

A

E.g., at x = 50, there is much wider distribution and more scatter between points compared to x = 30

Question 48

Q

As with other models, what should you do to continuous variables before analysis?

Answer

A

Centre them

Question 49

Q

On what assumption are hypothesis tests and confidence intervals for coefficients based?

Answer

A

The coefficients are normally distributed - called the “normal approximation”
This is realistic for large samples, and when the Poisson model assumptions are met

Question 50

Q

What is routinely displayed when conducting Poisson regression?

Answer

A

A z-test of H0: β = 0
Where:
- z = β^ / SE^ - β^ is the estimated coefficient, and SE^ is the estimated standard error
- The p-value provides information about the strength of the evidence against H0

Question 51

Q

How is the 95% CI calculated?

Answer

A

β^ ± 1.96 x SE^

Question 52

Q

How can we obtain a more interpretable indicator of the effect of an IV?

Answer

A

By exponentiating the coefficients (back-transforming from the log-scale to the scale of the count variable). The same can be done for 95% CIs for the raw coefficients
The exponentiated coefficients are called incidence rate ratios (IRRs) or rate ratios (RRs)

Question 53

Q

For continuous predictors, the size of the IRR depends on what?

Answer

A

The scale on which I measure the predictor. The scale can make the effect look large or small, or more or less meaningful
E.g., an additional 1,000 people in the population may not make a big difference to the outcome, but a different of 10,000 citizens might

Question 54

Q

If the population is in 1000s, and I want to get an IRR (IRR = 1.007) associated with a 10,000 difference in population, what could I do?

Answer

A

Either:
- Take the IRR for population in 1000s and take the 10th power: IRR_10k = IRR_1k^10 = 1.007^10 = 1.072. This means a 10,000 population difference is associated with about a 7.2% higher number of the outcome.
Or:
- To get the IRR_10k via software, I could recode my population variable so that it measures population in tens of thousands, such that: pop10K = population1000/10
and use pop10k as my predictor

Question 55

Q

Does scaling change the predictions?

Answer

A

No, just that the IRR is on a different scale

Question 56

Q

What decides the extent to which a variable should be scaled?

Answer

A

The context (no particular formula)

Question 57

Q

What is it helpful to do at the start of the analysis?

Answer

A

Code variables at the start to have them scaled before analysis

Question 58

Q

How are coefficient estimates for Poisson regression found?

Answer

A

Via maximum likelihood estimation
Every model has a likelihood, and in general nested models are compared using LRTs

Question 59

Q

In what ways are methods for model comparison (e.g., LRTs) useful?

Answer

A

Single test of a hypothesis about multiple predictors
Single test of several dummy variables
Tests of interaction effects
When models are nested, LRTs can be used (although in case of testing multiple variables, the test cannot tell which of those variables are redundant in predicting the outcome)

Question 60

Q

When doing an LRT, how should you formulate your H0?

Answer

A

In respect to the study’s specific context

Question 61

Q

What are the degrees of freedom in an LRT equal to?

Answer

A

The number of additional parameters in the larger model

Question 62

Q

In the output, what would a coefficient of 0.1 mean?

Answer

A

Since interpretation is in rates (not absolute changes like in other models), a coefficient of 0.1 means a 10.5% increase in the rate (e^0.1 = 1.105)

Question 63

Q

Logistic regression transforms probabilities, what does Poisson regression transform?

Question 64

Q

Under H0, the LRT statistic has what kind of distribution?

Answer

A

Chi-square

Answer 64

A

Yes - by setting these two coefficients to zero, we get the smaller model

Answer 65

A

The comparison may be biased - we may adjust the analysis by a suitable indicator of cycling frequency e.g., total number of miles cycled in each city per year

Answer 66

A

log(mean no. of eye tests / population size / 1000) = β0 + β1X1 + β2X2
The outcome is a rate: number of eye tests per 1,000 people

Answer 67

A

Rates are not necessarily integers, so are not modelled well by a Poisson distribution
The solution is to use algebra and put the population size on the right-hand side of the equarion:
log(mean eye tests) = β0 + β1X1 + β2X2 + log(pop.size/1000) - this is the offset
This makes use of the result that log (a/b) = log(a) - log(b)

Answer 68

A

Like an additional variable in the equation, but the coefficient is set to 1 and not estimated (way of redefining the outcome to ensure we can use Poisson regression)
Offsets can be readily incorporated into count regression models using standard software

Answer 69

A

Although we could use log(pop.size) instead and slope coefficients, IRRs & SEs would be the same. Dividing by 1,000 gives:
- A more sensible intercept (log predicted number of eye tests in areas with IMD I per 1,000 people, rather than per person)
- More sensible predicted counts (predicted number of eye tests per 1,000 people, rather than per person)

Answer 70

A

No, but should state somewhere that it is being used

Answer 71

A

Log transformation (log link): Poisson regression relates the logarithm of the mean count to a set of predictors. This implies a curvilinear between numeric predictors and the count outcome

Answer 72

A

Log-scale, but can be exponentiated to give rate ratios

Answer 73

A

Maximum likelihood, and LRTs can be used to compare nested models