Logistic Regression II Flashcards

1
Q

What is logistic regression used for?

A

To model the relationship between multiple independent variables / m number of exposures (continuous or categorical) and a binary outcome (y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the logistic regression equation?

A

ln ( π / 1−π ) = β0 + β1x1 + … βmxm
This equation models the log-odds of an event occurring. βm represents the effect of predictor m when all other predictors are considered
ln ( π / 1−π ) = log-odds or logit
ln ( π ) = natural logarithm of the probability
π / 1−π = odds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you obtain OR from a logistic regression coefficient?

A

OR = exp(β)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the key assumptions of a BLR?

A
  • Observations are independent
  • No multicollinearity among independent variables
  • The outcome variable is binary
  • No unobserved confounders
  • The log-odds of the dependent variable are linearly related to continuous predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you interpret a logistic regression coefficient?

A
  • A positive coefficient means the predictor increases the log-odds of the event occurring
  • A negative coefficient means the predictor decreases the log-odds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a 95% CI for an OR tell you?

A

If the CI includes 1, the predictor is not statistically significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the LRT used for?

A

To compare nested logistic regression models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the hypotheses for the LRT?

A

H0: The simpler model (without the extra parameter) is sufficient
H1: The more complex model (with the extra parameter) provides a better fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do you interpret the LRT p-value?

A
  • If p < 0.05, reject H0 (favour the more complex model)
  • If p > 0.05, do not reject H0 (favour the simpler model)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you compute predicted probabilities?

A

π = exp(β0 + β1x1 + … + βxmx) / 1 + exp(β0 + β1x1 + … + βmxm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an interaction effect?

A

An interaction occurs when the effect of one predictor on the outcome depends on the value of another predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you interpret predicted probabilities?

A

They indicate the probability of an event occurring for a given set of predictor values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you interpret an interaction term?

A

If the interaction term is significant, the relationship between a predictor and the outcome varies by the interacting variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is the interaction tested in logistic regression?

A
  • Include an interaction term x1 x x2 in the model
  • Use an LRT to compare models with and without the interaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the Stata command for an interaction term?

A

logit <outcome> <i.predictor1##i.predictor2></outcome>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is centring a continuous variable?

A

Subtracting the mean from each value to improve interaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is centring useful in logistic regression?

A

It allows meaningful interpretation of interaction effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you interpret the coefficient for each predictor in a multivariable model?

A

The effect of x1 adjusting for x2, … xm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does the logit model give us and what does this mean for interpretation?

A

The log odds, but the OR is much better for interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Even though the OR is easier to interpret, why should we still use the logit model?

A

We still need to know the function of the logit because the significance of the OR (z statistic) is derived from the SEs and the log odds. We also need to know the logit to use the probabilities from the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How do we turn the equation for the log odds into the equation for the odds?

A

π / 1−π = exp(β0 + β1x1 + … + βmxm)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the mathematical steps for calculating probabilties?

A
  • We model the log odds or logit of a probability π
  • Which we turn into an equation for the odds
  • Which we turn into an equation to calculate probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the preferred method for presenting probabilities?

A

Use graphs as it’s more difficult to look at ORs in graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What model do we use to get the probabilities?

A

Logit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What values can coefficients take in a logit model?

A

Any value from minus infinity to positive infinity

25
Q

What does the 95% CI tell you in a logit model?

A

When the CI doesn’t cross zero, this indicates a significant difference (the threshold is 1 for OR)

26
Q

What do the margins and marginsplot commands do?

A

margins calculates the predicted probabilities
marginsplot allows you to plot the predicted probabilities

27
Q

What range of values do probabilities take?

28
Q

How would you use Stata to plot the predicted probabilities of CVD for males with diabetes aged 20 - 99 with age intervals of 10?

A

logit cvd i.sex age
margins diabetes, at(age = (20(10)99) sex = 0)\
marginsplot, title(“Predicted probabilities of CVD for males, by age and diabetes”) legend(subtitle(“diabetes”))

29
Q

What is the use of probabilities?

A

To communicate results to policymakers and the public. These would need to be plotted

30
Q

When should margins be run?

A

Immediately after the regression

31
Q

What is a common assumption violation of LRTs?

A

The number of observations in the models differ due to missing data on the variable in question

32
Q

How do you rectify differences in observations between two nested models in LRTs?

A
  • Study the variable initially and missing data needs to be coded as ‘.’
  • We need to run L0 excluding those with missing data
32
Q

What may change after excluding missing data in LRTs?

A

Estimates may change - we need to examine the missing data using descriptive statistics

33
Q

What would linearity indicate between the log odds of CVD and age?

A

The change in log odds of CVD is the same magnitude at different age ranges
However, a non-straight line may be better to model ln(odds) depending on the data (after graphing the predicted probabilities of the exposure and continuous outcome variables)

34
Q

What are some non-linear relationships?

A

Quadratic, cubic, logarithmic, and exponential

35
Q

What is the easiest departure from linearity?

A

Quadratic (square term of continuous exposure)

36
Q

What must you have when testing a quadratic term in LRTs?

A

The quadratic effect must be included with the linear term. These terms together give the shape of the line

37
Q

What are the hypotheses when testing a quadratic term?

A

H0: The odds of an outcome increase linearly with an exposure (we don’t need the squared term)
H1: The relationship is not linear (i.e., quadratic)

38
Q

How do you use an LRT to test a quadratic term?

A
  • Generate a new variable with the squared term to be added to the L1 model
  • If there is some evidence of a quadratic term (or generally non-linear), the variable can be categorised into meaningful categories rather than looking for more complex functions
39
Q

What would indicate a slight curve after testing for non-linearity?

A

If there is an additional increase with the squared term from the linear term. There is a positive association between the quadratic term and the outcome. If the linear term is positive and the quadratic term is negative, there is first an increase then a decrease, and vice versa. Sometimes one may be significant and the other insignificant - an LRT would confirm the overall association

40
Q

What does it mean if the quadratic term is significant?

A

The association is definitely non-linear. In this case, it’s advised to categorise the continuous variable

41
Q

Where does the assumption of linearity apply?

A

To any regression model

42
Q

How do we first test for interaction before fitting a model?

A

Descriptive statistics to compare prevalence

43
Q

In the context of interaction, what are we estimating with this command: logistic cvd i.diabetes i.agecategories ?

A

We estimated the joint effect of diabetes and age assuming constant ORs across strata. Not looking at any specific strata

44
Q

What does adding an interaction term between age and diabetes look like in the regression equation?

A

ln ( π / 1−π ) = β0 + β1 x diabetes + β2 x agecategories + β3(diabetes x agecategories)
We are introducing a new variable - the product of diabetes and age to look at a specific stratum

45
Q

What if we wanted to test interaction in a model with 10 predictors?

A

May be too unwieldy - you can use theory as a guide rather than testing for all possible interactions

46
Q

What happens to the output when we add an interaction term?

A

The output changes
The odds of the exposure should be reported separately for different levels of the effect modifying/interpreting variable

47
Q

With this command logistic cvd i.diabetes##i.agecategories - what does the intercept represent (reference group refers to those aged 50 years old or younger and without diabetes)?
- what would an OR of 3.143 against diabetes represent?
- what would an interaction term of 0.8 represent?

A
  • Those aged 50 years old or younger and without diabetes
  • 3.143 is the odds of having CVD among people with diabetes compared to those without diabetes aged less than 50 (stratum-specific OR) i.e., the odds of having CVD for those with diabetes are higher than the odds of having CVD for those without diabetes by a factor of 3.1 at the reference group
  • 0.8 is an interaction term used to look at different categories (used to look at the unseen categories i.e., 1 for age and 1 for diabetes)
48
Q

With this command logistic cvd i.diabetes##i.agecategories - how would we use the interaction term (0.8) to get the effect of diabetes on those aged 51+? (reference = those aged 50 years old or younger and without diabetes)
Note: The OR 3.143 is used

A

Multiply the OR for diabetes (3.143) by the interaction term (0.8)
This is equal to 2.514 (the effect of diabetes on CVD is decreasing with age). The odds of CVD for those with diabetes are higher than the odds of having CVD for those without diabetes by a factor of 2.5 among those aged 51+

49
Q

With this command logistic cvd i.diabetes##i.agecategories
- how would we use the interaction term (0.8) to get the OR for the effect of age on CVD among those with diabetes? (reference = those aged 50 years old or younger and without diabetes)
Note: The OR 4.23 is used

A

Multiply the OR for agecategories (4.23) by the interaction term (0.8) which is equal to 3.4.
- The odds of CVD for those aged 51+ are higher than the odds of CVD for those aged or younger than 50 by a factor of 4.2 at the reference group (=among without diabetes)
- The odds of CVD for those aged 51+ are higher than the odds of CVD for those aged or younger than 50 by a factor of 3.4 among those with diabetes

50
Q

What does it mean if the interaction term is non-significant?

A

All the differences are not real

51
Q

What can we do instead of adding an interaction term to estimate the effect of age and diabetes on CVD?

A

Stratify the analysis i.e., run the logistic regression for CVD by each category (i.e., age = 0 & 1 / diabetes = 0 & 1)
The interaction term computes all these stratum-specific ORs

52
Q

When would you stratify the analysis instead of adding an interaction term?

A

If the interaction term is significant. Separate regressions reduce the sample size - otherwise, you may erroneously conclude the differences are real when they are not

53
Q

How can you get CI and p-value for interaction?

A

Use lincom e.g., lincom 1.diabetes + 1.diabetes#1.agecategories

54
Q

What are the hypotheses for using an LRT to test for an interaction between age and diabetes on the effect of CVD?

A

H0: The odds of CVD do not differ in diabetics and non-diabetics according to age groups (i.e., there is no effect modification/interaction between diabetes and age groups) - if the two models are the same, an interaction term is not needed
H1: The odds of CVD differ in diabetics and non-diabetics according to age groups

55
Q

What does Stata assume the first value of a continuous variable is?

A

0 - this is not always meaningful for variables like BMI as this doesn’t start from 0

56
Q

What should you do to a continuous variable before adding an interaction term?

A

Centre it at the mean (represented by 0) - this is indicated by adding a ‘c.’ before the continuous variable when fitting a model

57
Q

What is regression equation when adding an interaction term between an exposure and a continuous predictor?

A

ln ( π / 1−π ) = β0 + β1 x diabetes + β2 x agecentred + β3(diabetes x agecentred)
- The slope of agecentred differs for those with and without diabetes

58
Q

When testing for an interaction term when estimating odds of CVD, if the OR for agecentred is 1.05 and the interaction term between agecentred and diabetes is 0.99, how would this be interpreted?

A
  • 0.99 is the interaction effect and is the difference in slope between those with diabetes and those without diabetes
  • 1.05 is the increase in the odds of CVD per year increase in age, among those without diabetes
  • 1.03 = 1.05 * 0.99 is the increase in the odds of CVD per each year increase in age, among those with diabetes