Causal inference Flashcards

1
Q

Study designs for causal inference

A

Randomised controlled trials
Natural or quasi experimental studies
Longitudinal studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Analysis methods for causal inference

A

Confounder adjustment/stratification
Propensity Scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

RCT:

A

Manipulation: The scientist actively interacts with the environment by modifying certain aspects (X) (according to F. bacon)
Randomization: Subjects are assigned to active (X=1) or control condition (X=0) at random, i.e., regardless of their characteristics. After successful randomization, samples assigned to active, or control condition are equivalent in all aspects (Z) but the exposure (X).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Issues with RCT’s

A

Lack of external validity/generalizability

Both groups balanced for cofounders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pyramid of evidence

A

Level 1: Systematic review of RCT
Level 2: Single RCT
Level 3: Systematic review of observational studies
Level 4: Single observational study
Level 5: Qualitative studies
Level 6: Expert opinion/case studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pros RCT

A

Treatment and control groups can be matched
Strong evidence for cause-effect relationships
Statistical methods are relatively straightforward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cons of RCT

A

Expensive, take a long time
Potentially unethical in some circumstances
Prone to selection bias
It might not be generalizable to the real world (too artificial)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Natural or Quasi-experimental designs

A

Not everything can be an RCT, not possible or not ethical
Can use longitudinal prospective samples
Chronology used to parse out causality, but you are looking at how t0 affects t1, but you cannot see this in reverse to determine the direction of the effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Twin studies

A

People love because matched for genes and home-enviro, almost like natural RCT

Monozygotic twins are naturally “matched” for many common confounders
Grew up in the same family at the same time, and share genetic risk factors

Differences in outcomes associated with discordant exposures within twin pairs are often described as causal

Discordant twin design (twins with different in utero outcomes)

Naturally only a twin sample, which may be a specific experience and cannot be easily generalized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Instrumental Variable design

A

No manipulation or randomization

Although there is no manipulation or randomization of the exposure variable (X), there is manipulation and/or randomization of an instrumental variable (I).

An instrumental variable (I) must:
be correlated with an exposure X (ideally explaining much of the variance in X),
be NOT correlated with the error Z,
be correlated with an outcome Y only through X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Counterfactual

A

If there are no twins, instruments or policies
No exposure is a causal factor in itself, in isolation.

In this context, we are thinking about causality as a difference between two groups that are otherwise identical, only exists as contrast

Causality may only be derived as part of a well-defined contrast between one condition (e.g., exposure) and an alternative condition (e.g., no exposure), while holding everything else constant.

Causal contrasts can be estimated by using substitutes for the counterfactual condition.

To the extent that substitutes are equivalent to the factual condition in all aspects but the exposure (i.e., they are exchangeable with the counterfactual condition), substitutes can be used to infer causality.

In epidemiology, substitutes are generally either a population other than the target population during the same etiological period or the target population observed at a time other than the etiological period

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cholera counterfactual case

A

​​In 1854, John Snow mapped cholera cases in Soho and found that most people who had died from cholera had drunk the water from the Broad Street Pump.

Snow argued to close the pump and removed its handle and the cholera outbreak ended.

Cannot actual be sure this is the cause because not a formal manipulation, but…

Observed outcome (Y), cholera deaths

Exposure (X), Water pump

Two conditions:
0. Pump is closed (X=0), no water is coming out, the neighbourhood is unexposed to the potentially contaminated water
1. Pump is left open (X=1), water is coming out the pump, the neighbourhood is exposed to the potentially contaminated water.

0 is what happened
Observed outcome, cholera deaths (Y), of what happened when exposure (X) was “closed” (X=0).
Y|X=closed = YX=0 = Y0

Unknown potential outcome (Y) that would have happened if the exposure (X) had been – counter to the fact – left “open” (X=1).
Y|X=Open = YX=1 = Y1

Potential outcome: To identify the causal effect of closing the pump on mortality, we would need to compare:
Y0 :The number of deaths (Y) when the pump was closed (X=0)
Y1: The number of deaths (Y) when the pump was left open (X=1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FUNDAMENTAL PROBLEM OF CAUSAL INFERENCE

A

We can never know the potential outcome for a counterfactual exposure!
For each ‘unit of analysis’ (condition) we can only observe one potential outcome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Exchangeability

A

Causal contrasts can be estimated by using substitutes for the counterfactual condition.
To the extent that substitutes are equivalent to the factual condition in all aspects but the exposure (i.e., they are exchangeable with the counterfactual condition), substitutes can be used to infer causality.
Different statistical approaches can be taken to improve equivalence/exchangeability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Statistical approaches for exchangeability

A

Stratification in regression
Propensity scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Directed Acrylic Graphs (DAGs)

A

Use DAGs to visualise your theoretical causal model, between exposure and outcome of interest, including all Confounders (C) and Mediators (M)
X – Smoking
Y – Lung cancer

Must also consider other possible factors that cause x and y, known as confounders
These are unfortunately also associated with eachother
Must also consider mediaators

17
Q

Confounders

A

other possible factors that cause x and y
possible factors the cause x and y
X <- C -> Y

18
Q

Mediators

A

anything caused by X that in turn causes Y
X -> M -> Y

19
Q

Moderators

A

a third variable that affects the strength of the relationship between a dependent and independent variable

20
Q

Stratification in regression models

A

Stratifying is sorting data or people into distinct layers, for example, male and female

Allows adjustment for confounders

Mediators are already included in the model

21
Q

Propensity scores

A

Attempts to use confounders to split process into two stages
Looking at everything that is related to X
Looking at everything that is related to Y

Can use this to create a propensity score, to see how likely a person is to have an exposure, and use that score as a covariate

It is a balancing score to allow comparisons between subjects with similar prior probabilities of experiencing a certain exposure.

Exchangeability is limited to the variables included in the model.

Exchangeability should be examined by testing if the distribution of confounding variables is similar between exposed and non-exposed with similar propensity scores (e.g., standardized mean difference).

22
Q

Two ways to do propensity scores

A

Stratification of Propensity Score
Propensity score matching

Both basically uses all confounders to create a risk score to compare with outcome, instead of one causal factor

23
Q

Stratification of Propensity Score

A

Subjects are stratified into subsets based on previously defined thresholds of the estimated propensity score (e.g., quintiles).

Within each propensity score stratum, treated and untreated subjects will have roughly similar values of the propensity score.

The average treatment effect is computed first within each stratum then pooled across strata.

24
Q

Propensity score matching

A

It consists in forming matched sets of exposed and non-exposed subjects with similar values of propensity score.

Matching is generally 1:1 with nearest neighbor, without or with replacement (non-exposed subject used once or more).

Compute average treatment effect (‘unbiased’ effect of exposure) as average of effect in matched pairs.

25
Q

Why is Missing Data a problem

A

Missing data is a threat to causal inference, regardless of study design.

Like confounders, specific patterns of missingness may generate artificial associations between exposure and outcome.

Missing data can introduce significant bias in the analysis by influencing the exchangeability/balance of the causal contrast

26
Q

Types of missing data

A

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNR)

27
Q

Missing completely at random (MCAR)

A

Missingness is completely unrelated to all other variables in the dataset

ex. all participants have the same likelihood of having missing data (a researcher lost some interview booklets)

28
Q

Missing at random (MAR)

A

Missingness is related to exposure X (or, more generally, to an observed variable)

Given X, missingness does not depend on Y (or, more generally, unobserved data).

Because missingness is related to X, exposed and unexposed groups are not balanced.

Analysis of complete observations would give a biased estimate of the effect of exposure.

However, it is possible to achieve balance by using information in X.

ex. victims of maltreatment are more likely to have missing data (drop out of the study).

29
Q

Missing not at random (MNR)

A

Missingness is related to outcome Y (or, more generally, to an unobserved variable).

Missingness depends on Y (or, more generally, unobserved data).

Because missingness is related to Y, we do not know if exposed and unexposed groups are balanced.

Analysis of complete observations would give a biased estimate of the effect of exposure.

It is not possible to achieve balance by using information in X.

ex. individuals with depression are more likely to have missing data

30
Q

Methods to deal with missing data

A

Naive methods
List-wise deletion
Pair-wise deletion
Single imputation

Modern methods
Multiple imputation (explicit)
Maximum likelihood estimation (implicit).

We can only identify mechanisms of missingness based on X (observed data), that is, we can only discriminate between MCAR and MAR.

Multiple imputation and Maximum likelihood estimation assume MAR – they generate imputed values by considering how patterns of missingness relate to exposure X and other observed variables.

Neither method deals effectively with MNAR conditions, which is dealt with through structural equation modeling (maximum likelihood estimations) and sensitivity analysis of different assumptions.

31
Q

Pair-wise deletion

A

eliminates cases with missing values on an analysis-by-analysis basis [assumes MCAR].

31
Q

List-wise deletion

A

eliminates cases with missing values from all analyses [assumes MCAR].

32
Q

Single imputation

A

replaces all missing values with a single set of values, such as
(1)the arithmetic mean [assumes MCAR],
(2)a score at a prior assessment,
(3)a score from another individual with a similar set of background characteristics, or
(4)a predicted scores from regression analysis [assume MAR, does not include a measure of uncertainty for the imputed value, with unrealistically small SE].

33
Q

Multiple imputation (explicit)

A

It generates multiple filled-in datasets from different imputation algorhythms and produces average imputed values. assume MAR

34
Q

Maximum likelihood estimation (implicit)

A

It maximizes the likelihood of observed multivariate data, and uses the resulting distribution to produce imputed values. It includes error by MLE function.