Intro to generalised linear models/Categorical Data Flashcards

1
Q

What is the distribution in each sample characterised by for binary variables?

A

The proportion of “1”s in each sample

E.g. if the binary outcomes = presence of disease, then the data set is completely characterised by the two group sizes and the relative frequency of the disease in each group

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What indices exist for measuring the effect/impact of the group membership (typically Exposure)?

A

Relative risk (RR)
Odds ratio (OR)
Risk difference – not considered here

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What can population indices be estimated (without bias for large) samples) by?

What can also be derived?

A

Relevant indices constructed from the independent samples.

Confidence intervals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What remains to introduce?

What does this amount to?

A

Tests for formally assessing whether the distributions of binary outcomes differ between the two categories.

Testing the null hypotheses (all equivalent):
- Equal proportions of “1s” in the two populations (RR=1)

  • Equal odds in the two groups (OR=1)
  • To access whether the distribution of binary outcomes differ between two categories chi square of fishers exact test can be used.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we estimate the probability of a category?

A

By its relative frequency in the sample

An exact 95% CI for the proportion can be generated by the ci command applied to a binary variable where the category coded “1” is the one we are interested in.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does a contingency table summarise?

A
  • The frequency distribution of each of two categorical variables as well as the association between two categorical variables

In its simplest form, each cell of a two-way table contains the frequency counts of a variable’s category in relation to a category of another variable

  • the row and column totals represent the (marginal) distributions of the variables
  • the concept can be extended to multi-way contingency tables (not here)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the command bitest bin==p0 provide?

A

Provides an exact test for the null hypothesis that the probability of the “1” category is a specified value p0.

E.g. test the null hypothesis that the proportion of female and male births in the UK in 1958 was the same ( Prob(female) = Prob(male) = 0.5):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you calculate the odds that an exposed person develops disease?

A

Divide exposed number with disease (a) by exposed number with no disease (b)

a/b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an Odds ratio?

A

The ratio of the odds of developing the disease in the exposed to the odds of developing the disease in the non-exposed: (a/b)/(c/d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you calculate the odds ratio that a non-exposed person develops disease?

A

Divide non-exposed with disease (c) by non-exposed without disease (d)

c/d

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the risk ratio?

A

The ratio of the risk of developing the disease in the exposed to the odds of developing the disease in the non-exposed:

(a/(a+b))/(c/(c+d))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can you calculate the risk that an exposed person develops disease?

A

Divide total of those exposed with disease (a) by total of those exposed with disease and those not exposed with disease (a+b_

a/(a+b)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can you calculate the risk that a non-exposed person develops disease?

A

Divide total of those not exposed with disease (c) by those non exposed with disease and those non-exposed with no disease (c+d)

c/(c+d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does OR=1 mean?

A

Exposure does not affect the odds of outcome

E.g There is no difference in the odds of suffering malaise between males and females.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does OR < 1 mean?

A

Indicates the exposure is associated with the reduced risk of developing the outcome

E.g if the odds ratio = 0.339 then the odds of a male suffering from malaise is a third (33.9%) of those of a female.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does OR >1
mean?

A

Indicates that the exposure is associated with an increase risk of developing the disease

E.g. if the odds ratio = 1.5 then the odds of males suffering malaise is 1.5 times that of females suffering from Malaise.

17
Q

What can the null hypothesis of statistical independence between the row and column variables can be tested using?

What commands are included in STATA to achieve this?

A

A chi-squared test or an exact test.

tab2: options provides output for both tests

chi2: gives chi-squared test (test statistic, d.f. and p-value)

expected: prints the absolute cell frequency expected under the null hypothesis
- should be at least 5 for chi-squared test use
- comparison with observed count can be helpful for interpretation of associations

cchi2 shows the contribution of the cell to the test statistic

exact provides the p-value for Fisher’s exact test

18
Q
  1. What does the McNemar test assess?
  2. How does the McNemar achieve this?
A
  1. The null hypothesis that the distribution of a binary outcome (e.g. the proportion of ‘1’s) is the same in each of the groups making up the pairs

E.g. proportion of individuals who are older than or exactly 40 years at marriage is the same for males and females.

  1. The McNemar test calculates
  • the proportion of pairs that show a (‘1’,’0’) discrepancy
  • and the proportion of pairs that show a (‘0’,’1’) discrepancy
  • and compares these proportions with those expected if both discrepancies were equally likely.

When there is a significant E-O association the off-diagonal cells in the 2x2 table can be looked at to understand its direction.

19
Q

What does a stratified analysis involve?

A

Restriction of the E-O association analysis to narrow ranges (the strata) of an extraneous variable (e.g. the putative confounder)

and pooling of the information over all strata (if appropriate)

20
Q

What steps does the stratified analysis involve?

A
  1. Categorise the variable(s) to be controlled (the putative confounder(s));
  2. Estimate the association between exposure and outcome within each stratum separately;
  3. If the stratum-specific indices are similar combine them into a single index;
    …vaguely by taking some kind of weighted mean

4.Carry out a test for this combined index.

21
Q
  1. What is the best known stratified analysis approach?
  2. What does the procedure provide?
A
  1. The Mantel-Haenszel test for analysing odds ratios within strata.
  2. The procedure provides (no details here):

OR inferences (estimate, 95% CI and test for independence) for each stratum

Tests for assessing the homogeneity of the ORs across strata
- a significant p-value speaks for effect modification and against combining the OR values

A test of the null hypothesis that the (constant) within-stratum OR is 1 (i.e. no association)

An estimate and 95% CI for the within-stratum OR.

22
Q
  1. How can a logistic model be derived?
  2. What is an issue with this?
A
  1. By constructing a model in which the predicted probability of an event- depicted by 1 in binary variable, is forced to represent the predicted probability as lying between 0 and 1

Force Prob(y=1 | x)= xb+e to lie within 0 and 1.

  1. Linear probability model can produce probabilities greater than 1 and less than 0.
23
Q

How can the issue with logistic regressions be overcome?

A

Transforming probability into odds

The odds of event A can be defined as the probability that A does happen divided by the probability that A does not happen.

Odds(A) = prob(A happens)/prob(A does not happen) = prob(A)/(1-prob(A))

For example, if prob(A)=1/2 then Odds(A) = (½)/(1-1/2) = 1

Odds lie between 0 (when prob(A)=0) and ∞ (when prob(A)=1). – This is still limited as does not include negative numbers

That is: 0 < prob(A)/(1-prob(A)) < ∞

But, the log of the odds (logit) are continuous so we can fit a linear model. – Includes negative numbers
That is: -∞ < 𝑙𝑛( prob(A)/(1-prob(A)) ) < ∞

The logit 𝑙𝑛((𝑝𝑟𝑜𝑏(𝐴))/(1−𝑝𝑟𝑜𝑏(𝐴))) allows us to express the binary outcome continuously and create a linear relationship between the exposure/predictor variable and the outcome

24
Q

Describe the logistic regression terminology and what it means

A

ln [p / (1-p)]= a + B1x1 +B2x2 +… (y = a + B1x1 +B2x2 …in linear regression)

ln [p /(1- p)] = log-odds = logit(p);

L = a + B1x1 +B2x2.. is called the linear predictor

Model is linear in the logit but non-linear in the probability p

When x1 is continuous, the intercept a is the value of the log odds when x1=0, and the slope 𝛽_1 represents the increase in log odds (of a positive outcome, or equivalently when y=1) when x1 increases by 1 unit.

To be able to interpret the parameters as natural odds or odds ratios we need to exponentiate them.