Logistic Regression I Flashcards
What is the purpose of logistic regression?
It models the relationship between a binary outcome and one or more independent variables
How is logistic regression different from linear regression?
Unlike linear regression, logistic regression is used for binary outcomes and applies a logit transformation to ensure predicted probabilities stay between 0 and 1
What types of variables can be used as predictors in logistic regression?
Continuous and categorical
What is the formula for the logistic regression model?
To express the response probability as a linear function:
ln ( π / 1 - π ) = β0 + β1X
Where:
- ln ( π / 1 - π ) = log odds or logit
- π / 1 - π = odds
- B0 = intercept/constant
- B1 = coefficient for predictor X/slope
How do you convert logistic regression coefficients to an OR?
Take the exponent of the coefficient:
Odds ratio = e^B1
What is an OR?
The ratio of the odds of an event occurring in one group compared to another
Odds ratio = odds (group 1) / odds (group 2)
How do you interpret OR?
OR = 1: No effect (same odds in both groups)
OR > 1: Higher odds of the event occurring
OR < 1: Lower odds of the event occurring
What are probabilities and odds?
- Probability (risk): π = occurrences/opportunities (range = 0 - 1)
- Odds: π / 1 - π (probability of event occurring / probability of an event not occurring; range 0 - plus infinity)
What does an OR of 1.5 mean?
The event is 50% more likely to occur in the exposed group than in the reference group
What does an OR of 0.75 mean?
The event is 25% less likely to occur in the exposed group compared to the reference group
What is the likelihood ratio test (LRT) used for?
To compare two nested models and determine if adding a predictor improves the model
What are the hypothesis for the LRT?
H0: Adding the variable makes no difference
H1: Adding the variable improves the model
How do you calculate the LRT statistic?
LRT = 2 [ ln (L1) − ln (L0) ]
Where L1 is the likelihood of the full model and L0 is the likelihood of the nested model
What are the key assumptions of logistic regression?
- Observations are independent
- The outcome variable is binary
- No multicollinearity (high correlation between predictors)
- The relationship between continuous predictors and log-odds is linear after adjusting for any other covariates
- No significant interactions unless explicitly modelled. The effect of an exposure is the same regardless of the value of any other independent variable, and vice versa.
- No unobserved confounding
Why do we use multivariable logistic regression?
To adjust for confounders like age, sex, and BMI
How do you fit a logistic regression model in Stata?
logistic <outcome> <predictor1> <predictor2> <predictork>
Note: Adding 'i' before a categorical variable tells Stata to interpret it as a categorical variable</predictork></predictor2></predictor1></outcome>
How do you compare two models using LRT in Stata?
Store each model to memory:
<est>
<est>
Then compute an LRT test:
<lrtest>
</lrtest></est></est>
What is the logit link function?
A BLR model uses a logit (or logistic) transformation of probability π
π = exp(β0 + β1x) / 1 + exp(β0 + β1x)
What are the properties of logit transformation of π?
- Produces values of π between 0 and 1
- Symmetric about π = 0.05
- The curve is almost a straight line for 0.2 < π < 0.8
What does it mean if π = 0.5?
The event is equally likely to occur or not occur, using the equation the odds would be:
odds = 0.5 / 1 - 0.5 = 0.5 / 0.5 = 1
What does it mean if π = 0.4
The event is more likely to occur than not
Odds: = 0.4 / 1 - 0.4 = 0.4 / 0.6 = 0.67
What is the log odds a result of?
The linear relationship between Y given X is specified after making a logit transformation. This results in the log odds, which can be positive or negative
What does it mean if π > 0.5? E.g., if π = 0.6
The event is more likely to occur than not
Odds: 0.6 / 1 - 0.6 = 0.6 / 0.4 = 1.5
How do you fit a logit model in Stata?
logit <outcome> <predictors></predictors></outcome>
What’s the difference between the <logit> and <logistic> command?</logistic></logit>
The <logistic> command gives ORs
Exponentiation of the coefficients from <logit> gives the ORs in <logistic></logistic></logit></logistic>
How would you interpret an OR of 0.89?
The odds of an event for [Group 1] are 0.89 times the odds of [Group 2]; or the odds of [Group 1] are lower than the odds of [Group 2] by 11% = 100% x (1 - 0.89)
What does significance testing involve in logistic regression?
H0: β0 = 0
H1: β ≠ 0
Tests whether the null hypothesis that the true parameter value is 0 and the associated OR is 1
Z-ratio statistic is calculated and compared with a normal distribution (only used with the logit function)
Note: With one independent variable, the Wald test is equivalent to the z-ratio statistic
How is the z-ratio statistic calculated?
Using the output of the logit function in Stata
z = β^ / SE (β^)
How is the Wald statistic calculated?
W = β^2 / var (β^2)
What does a 95% CI of 0.75 and 0.97 tell us about where the OR lies?
The OR lies between 0.75 and 0.97
If we repeated our study 100 times, the OR would likely lie between two values 95 out of 100 times
Overall, we are 95% confident that the true OR likely lies between 0.75 and 0.97
What does it mean if 95% CIs do not include 1?
If the 95% CIs do not include the null hypothesis of no effect (OR = 1), there is evidence that the difference between the exposed and the reference group is significant
Which command should be used to obtain ORs and to make explicit the category of the slope parameter?
logistic <outcome> <i.exposure></i.exposure></outcome>
What is the utility of using ‘i.’ before a categorical variable in regression?
Stata creates a set of binary variables prompted by ‘i.’
They all use the same reference category which by default is the first category
If we have age with multiple categories as a predictor, how can we estimate the overall effect of age on the outcome?
Use <testparm> command (Wald test) to test joint hypotheses
Taking the age variables as a whole tests whether the coefficients of each age category are simultaneously zero</testparm>
What is the issue with Wald test with small sample sizes?
The Wald test evaluates whether it is likely that an estimated coefficient could be zero (no effect). However, the Wald test is not very reliable with small sample sizes
What is the likelihood ratio test (LRT)?
Test of the goodness-of-fit between two nested models
What is a nested model?
A model that includes the same covariates as another but specifies at least one additional parameter to be estimated
When are models not deemed to be nested?
If the number of observations is different compared to the other model
LRT hypotheses:
H0: The two nested models are equivalent - adding a new variable makes no different to the model
H1: The two nested models are different, and a variable contributes to the model
LRT statistic
LRT = 2[log(L1) - log(L0)]
L1: likelihood of the model that includes the variable you want to test (model 1)
L0: Likelihood of the model that does not include the variable (the “nested” model; model 0)
What does a p-value < 0.05 indicate in an LRT?
There is evidence for an effect of the predictor on the outcome after accounting for other vaeuables
In the context of an LRT, if the p-value < 0.05, what should you do?
Always opt for the simpler model
What kind of test is the LRT?
chi-square test
For continuous exposures, what does the exponentiated estimated coefficient (in a logit model) represent?
The difference in the odds of the outcome, which increases multiplicatively by the value of the OR for a one-unit change in the exposure
What’s the procedure to comparing two models using an LRT?
- Obtain L1 by fitting the larger model: quietly: logistic <outcome> <predictor1> <predictor2></predictor2></predictor1></outcome>
- Save this model: est store a
- Obtain L0 by fitting smaller model: quietly: logistic <outcome> <predictor1></predictor1></outcome>
- Save this model: est store b
- Compare L1 and L0: lrtest a b
What happens if adding variables changes the sample size between two models in the context of an LRT?
If L1 has a different sample size to L0, the same amount of people missing from L1 needs to be subtracted from L0. The removal may not be random depending on the mechanism by which the missing data occurs