Logistic Regression II Flashcards

Question 1

Q

What is logistic regression used for?

Answer

A

To model the relationship between multiple independent variables / m number of exposures (continuous or categorical) and a binary outcome (y)

Question 2

Q

What is the logistic regression equation?

Answer

A

ln ( π / 1−π ) = β0 + β1x1 + … βmxm
This equation models the log-odds of an event occurring. βm represents the effect of predictor m when all other predictors are considered
ln ( π / 1−π ) = log-odds or logit
ln ( π ) = natural logarithm of the probability
π / 1−π = odds

Question 3

Q

How do you obtain OR from a logistic regression coefficient?

Answer

A

OR = exp(β)

Question 4

Q

What are the key assumptions of a BLR?

Answer

A

Observations are independent
No multicollinearity among independent variables
The outcome variable is binary
No unobserved confounders
The log-odds of the dependent variable are linearly related to continuous predictors

Question 5

Q

How do you interpret a logistic regression coefficient?

Answer

A

A positive coefficient means the predictor increases the log-odds of the event occurring
A negative coefficient means the predictor decreases the log-odds

Question 6

Q

What does a 95% CI for an OR tell you?

Answer

A

If the CI includes 1, the predictor is not statistically significant

Question 7

Q

What is the LRT used for?

Answer

A

To compare nested logistic regression models

Question 8

Q

What are the hypotheses for the LRT?

Answer

A

H0: The simpler model (without the extra parameter) is sufficient
H1: The more complex model (with the extra parameter) provides a better fit

Question 9

Q

How do you interpret the LRT p-value?

Answer

A

If p < 0.05, reject H0 (favour the more complex model)
If p > 0.05, do not reject H0 (favour the simpler model)

Question 10

Q

How do you compute predicted probabilities?

Answer

A

π = exp(β0 + β1x1 + … + βxmx) / 1 + exp(β0 + β1x1 + … + βmxm

Question 11

Q

What is an interaction effect?

Answer

A

An interaction occurs when the effect of one predictor on the outcome depends on the value of another predictor

Question 12

Q

How do you interpret predicted probabilities?

Answer

A

They indicate the probability of an event occurring for a given set of predictor values

Question 13

Q

How do you interpret an interaction term?

Answer

A

If the interaction term is significant, the relationship between a predictor and the outcome varies by the interacting variable

Question 14

Q

How is the interaction tested in logistic regression?

Answer

A

Include an interaction term x1 x x2 in the model
Use an LRT to compare models with and without the interaction

Question 15

Q

What is the Stata command for an interaction term?

Answer

A

logit <outcome> <i.predictor1##i.predictor2></outcome>

Question 16

Q

What is centring a continuous variable?

Answer

A

Subtracting the mean from each value to improve interaction

Question 17

Q

Why is centring useful in logistic regression?

Answer

A

It allows meaningful interpretation of interaction effects

Question 18

Q

How do you interpret the coefficient for each predictor in a multivariable model?

Answer

A

The effect of x1 adjusting for x2, … xm

Question 19

Q

What does the logit model give us and what does this mean for interpretation?

Answer

A

The log odds, but the OR is much better for interpretation

Question 20

Q

Even though the OR is easier to interpret, why should we still use the logit model?

Answer

A

We still need to know the function of the logit because the significance of the OR (z statistic) is derived from the SEs and the log odds. We also need to know the logit to use the probabilities from the model

Question 21

Q

How do we turn the equation for the log odds into the equation for the odds?

Answer

A

π / 1−π = exp(β0 + β1x1 + … + βmxm)

Question 22

Q

What are the mathematical steps for calculating probabilties?

Answer

A

We model the log odds or logit of a probability π
Which we turn into an equation for the odds
Which we turn into an equation to calculate probabilities

Question 23

Q

What is the preferred method for presenting probabilities?

Answer

A

Use graphs as it’s more difficult to look at ORs in graphs

Question 24

Q

What model do we use to get the probabilities?

Question 25

Q

What values can coefficients take in a logit model?

Answer

A

Any value from minus infinity to positive infinity

Question 26

Q

What does the 95% CI tell you in a logit model?

Answer

A

When the CI doesn’t cross zero, this indicates a significant difference (the threshold is 1 for OR)

Question 27

Q

What do the margins and marginsplot commands do?

Answer

A

margins calculates the predicted probabilities
marginsplot allows you to plot the predicted probabilities

Question 28

Q

What range of values do probabilities take?

Question 29

Q

How would you use Stata to plot the predicted probabilities of CVD for males with diabetes aged 20 - 99 with age intervals of 10?

Answer

A

logit cvd i.sex age
margins diabetes, at(age = (20(10)99) sex = 0)\
marginsplot, title(“Predicted probabilities of CVD for males, by age and diabetes”) legend(subtitle(“diabetes”))

Question 30

Q

What is the use of probabilities?

Answer

A

To communicate results to policymakers and the public. These would need to be plotted

Question 31

Q

When should margins be run?

Answer

A

Immediately after the regression

Question 32

Q

What is a common assumption violation of LRTs?

Answer

A

The number of observations in the models differ due to missing data on the variable in question

Question 33

Q

How do you rectify differences in observations between two nested models in LRTs?

Answer

A

Study the variable initially and missing data needs to be coded as ‘.’
We need to run L0 excluding those with missing data

Question 34

Q

What may change after excluding missing data in LRTs?

Answer

A

Estimates may change - we need to examine the missing data using descriptive statistics

Question 35

Q

What would linearity indicate between the log odds of CVD and age?

Answer

A

The change in log odds of CVD is the same magnitude at different age ranges
However, a non-straight line may be better to model ln(odds) depending on the data (after graphing the predicted probabilities of the exposure and continuous outcome variables)

Question 36

Q

What are some non-linear relationships?

Answer

A

Quadratic, cubic, logarithmic, and exponential

Question 37

Q

What is the easiest departure from linearity?

Answer

A

Quadratic (square term of continuous exposure)

Question 38

Q

What must you have when testing a quadratic term in LRTs?

Answer

A

The quadratic effect must be included with the linear term. These terms together give the shape of the line

Question 39

Q

What are the hypotheses when testing a quadratic term?

Answer

A

H0: The odds of an outcome increase linearly with an exposure (we don’t need the squared term)
H1: The relationship is not linear (i.e., quadratic)

Question 40

Q

How do you use an LRT to test a quadratic term?

Answer

A

Generate a new variable with the squared term to be added to the L1 model
If there is some evidence of a quadratic term (or generally non-linear), the variable can be categorised into meaningful categories rather than looking for more complex functions

Question 41

Q

What would indicate a slight curve after testing for non-linearity?

Answer

A

If there is an additional increase with the squared term from the linear term. There is a positive association between the quadratic term and the outcome. If the linear term is positive and the quadratic term is negative, there is first an increase then a decrease, and vice versa. Sometimes one may be significant and the other insignificant - an LRT would confirm the overall association

Question 42

Q

What does it mean if the quadratic term is significant?

Answer

A

The association is definitely non-linear. In this case, it’s advised to categorise the continuous variable

Question 43

Q

Where does the assumption of linearity apply?

Answer

A

To any regression model

Question 44

Q

How do we first test for interaction before fitting a model?

Answer

A

Descriptive statistics to compare prevalence

Question 45

Q

In the context of interaction, what are we estimating with this command: logistic cvd i.diabetes i.agecategories ?

Answer

A

We estimated the joint effect of diabetes and age assuming constant ORs across strata. Not looking at any specific strata

Question 46

Q

What does adding an interaction term between age and diabetes look like in the regression equation?

Answer

A

ln ( π / 1−π ) = β0 + β1 x diabetes + β2 x agecategories + β3(diabetes x agecategories)
We are introducing a new variable - the product of diabetes and age to look at a specific stratum

Question 47

Q

What if we wanted to test interaction in a model with 10 predictors?

Answer

A

May be too unwieldy - you can use theory as a guide rather than testing for all possible interactions

Question 48

Q

What happens to the output when we add an interaction term?

Answer

A

The output changes
The odds of the exposure should be reported separately for different levels of the effect modifying/interpreting variable

Question 49

Q

With this command logistic cvd i.diabetes##i.agecategories - what does the intercept represent (reference group refers to those aged 50 years old or younger and without diabetes)?
- what would an OR of 3.143 against diabetes represent?
- what would an interaction term of 0.8 represent?

Answer

A

Those aged 50 years old or younger and without diabetes
3.143 is the odds of having CVD among people with diabetes compared to those without diabetes aged less than 50 (stratum-specific OR) i.e., the odds of having CVD for those with diabetes are higher than the odds of having CVD for those without diabetes by a factor of 3.1 at the reference group
0.8 is an interaction term used to look at different categories (used to look at the unseen categories i.e., 1 for age and 1 for diabetes)

Question 50

Q

With this command logistic cvd i.diabetes##i.agecategories - how would we use the interaction term (0.8) to get the effect of diabetes on those aged 51+? (reference = those aged 50 years old or younger and without diabetes)
Note: The OR 3.143 is used

Answer

A

Multiply the OR for diabetes (3.143) by the interaction term (0.8)
This is equal to 2.514 (the effect of diabetes on CVD is decreasing with age). The odds of CVD for those with diabetes are higher than the odds of having CVD for those without diabetes by a factor of 2.5 among those aged 51+

Question 51

Q

With this command logistic cvd i.diabetes##i.agecategories
- how would we use the interaction term (0.8) to get the OR for the effect of age on CVD among those with diabetes? (reference = those aged 50 years old or younger and without diabetes)
Note: The OR 4.23 is used

Answer

A

Multiply the OR for agecategories (4.23) by the interaction term (0.8) which is equal to 3.4.
- The odds of CVD for those aged 51+ are higher than the odds of CVD for those aged or younger than 50 by a factor of 4.2 at the reference group (=among without diabetes)
- The odds of CVD for those aged 51+ are higher than the odds of CVD for those aged or younger than 50 by a factor of 3.4 among those with diabetes

Question 52

Q

What does it mean if the interaction term is non-significant?

Answer

A

All the differences are not real

Question 53

Q

What can we do instead of adding an interaction term to estimate the effect of age and diabetes on CVD?

Answer

A

Stratify the analysis i.e., run the logistic regression for CVD by each category (i.e., age = 0 & 1 / diabetes = 0 & 1)
The interaction term computes all these stratum-specific ORs

Question 54

Q

When would you stratify the analysis instead of adding an interaction term?

Answer

A

If the interaction term is significant. Separate regressions reduce the sample size - otherwise, you may erroneously conclude the differences are real when they are not

Question 55

Q

How can you get CI and p-value for interaction?

Answer

A

Use lincom e.g., lincom 1.diabetes + 1.diabetes#1.agecategories

Question 56

Q

What are the hypotheses for using an LRT to test for an interaction between age and diabetes on the effect of CVD?

Answer

A

H0: The odds of CVD do not differ in diabetics and non-diabetics according to age groups (i.e., there is no effect modification/interaction between diabetes and age groups) - if the two models are the same, an interaction term is not needed
H1: The odds of CVD differ in diabetics and non-diabetics according to age groups

Question 57

Q

What does Stata assume the first value of a continuous variable is?

Answer

A

0 - this is not always meaningful for variables like BMI as this doesn’t start from 0

Question 58

Q

What should you do to a continuous variable before adding an interaction term?

Answer

A

Centre it at the mean (represented by 0) - this is indicated by adding a ‘c.’ before the continuous variable when fitting a model

Question 59

Q

What is regression equation when adding an interaction term between an exposure and a continuous predictor?

Answer

A

ln ( π / 1−π ) = β0 + β1 x diabetes + β2 x agecentred + β3(diabetes x agecentred)
- The slope of agecentred differs for those with and without diabetes

Question 60

Q

When testing for an interaction term when estimating odds of CVD, if the OR for agecentred is 1.05 and the interaction term between agecentred and diabetes is 0.99, how would this be interpreted?

Answer

A

0.99 is the interaction effect and is the difference in slope between those with diabetes and those without diabetes
1.05 is the increase in the odds of CVD per year increase in age, among those without diabetes
1.03 = 1.05 * 0.99 is the increase in the odds of CVD per each year increase in age, among those with diabetes