Missing data concepts and multiple imputation with Stata Flashcards
What are missing data?
Observations that could have been made but were not.
How can values be missed?
By design (intentionally), or unintentionally
In randomised controlled trials how are missing values most likely to occur?
In outcome variable(s)
- that is in terms of statistical analysis models, the response variables have MVs, while the explanatory
variables tend to be fully observed. - Randomisation groups, baseline values of the outcome variable, centres for multi-centre trials typically fully available
In observational studies the explanatory variables (covariates) are just as likely to contain what?
Missing values as the outcome measures, that is:
- the explanatory variable of interest may contain MVs
- or the covariates included in the model for the purpose of explaining background variability or
adjustment (confounders) may contain MVs
Why can’t we simply analyse the observed data using an appropriate analysis method?
There are a number of potential problems:
- Estimation method no longer valid
- Loss of precision
- Departure from the intention-to-treat principle (RCTs)
- Lack of generalisability
What is the main issue with missing data?
Bias
What is the best solution to dealing with missing data?
Avoid it
A valid estimator is one that is what?
- unbiased for the parameter of interest
- and its precision (standard error) can be quantified
What is an example of providing a valid estimator?
If there were no missing data then fitting an ANCOVA model to a RCT that deployed a before-after design, would provide a valid estimator for the therapy difference.
What is a problem that affects the use of valid estimators?
In the presence of missing data analysis methods that would provide valid estimators for the complete data do not necessarily provide valid estimators when applied to the observed data.
What issue may still persist even under circumstances where a valid estimator can be obtained from the observed data?
What is an example?
This might not be the most efficient estimator.
For example, as implemented in most software packages, repeated measures (M)ANOVA uses only subjects for which the response has been observed at all time points.
- Complete case analysis suffers a loss of precision since information from cases with partially observed multivariate responses (48%) is ignored
What do we ideally want?
An analysis method that provides valid inferences in the presence of missing data and uses all the available information.
What does the intention to treat (ITT) principle refer to?
A type of analysis specific to RCTs, and states that all subjects should be analysed as part of the treatment group which they were originally assigned to, irrespectively of the level of treatment received and protocol adherence.
What is the purpose of the intention to treat principle?
This advice is aimed at maintaining the benefits of randomisation, that is avoiding confounding of the group effect (=avoiding selection bias).
What leads to a departure from the ITT principle which can introduce selection bias?
Missing values
Example:
* less chronically mentally ill patients may be less likely to adhere to intensive management and are then more likely to be lost to follow-up in this group.
- if more chronic cases also tend to have more psychopathology and intensive therapy is beneficial then the group difference will tend to be underestimated based on the observed data.
What is generalisability?
Extent to which study results apply to the target population.
What can missing data affect?
The generalisability of the results from a trial or an observational study.
What is an example of RCT?
Suppose the most severely ill were most likely to be lost to follow-up (in both randomisation groups)
Then the observed results would be representative of a population in which the less severely ill are over-represented
What is the aim of data analysis?
Inference for a target population.
All data analyses are based on model assumptions about what?
target population
sampling process
What sampling method is typically used?
Random sampling
When data are missing and analyses are based on observed data further assumptions are being
made for what reason?
To describe how the observed data came about