Missing data concepts and multiple imputation with Stata Flashcards
What are missing data?
Observations that could have been made but were not.
How can values be missed?
By design (intentionally), or unintentionally
In randomised controlled trials how are missing values most likely to occur?
In outcome variable(s)
- that is in terms of statistical analysis models, the response variables have MVs, while the explanatory
variables tend to be fully observed. - Randomisation groups, baseline values of the outcome variable, centres for multi-centre trials typically fully available
In observational studies the explanatory variables (covariates) are just as likely to contain what?
Missing values as the outcome measures, that is:
- the explanatory variable of interest may contain MVs
- or the covariates included in the model for the purpose of explaining background variability or
adjustment (confounders) may contain MVs
Why can’t we simply analyse the observed data using an appropriate analysis method?
There are a number of potential problems:
- Estimation method no longer valid
- Loss of precision
- Departure from the intention-to-treat principle (RCTs)
- Lack of generalisability
What is the main issue with missing data?
Bias
What is the best solution to dealing with missing data?
Avoid it
A valid estimator is one that is what?
- unbiased for the parameter of interest
- and its precision (standard error) can be quantified
What is an example of providing a valid estimator?
If there were no missing data then fitting an ANCOVA model to a RCT that deployed a before-after design, would provide a valid estimator for the therapy difference.
What is a problem that affects the use of valid estimators?
In the presence of missing data analysis methods that would provide valid estimators for the complete data do not necessarily provide valid estimators when applied to the observed data.
What issue may still persist even under circumstances where a valid estimator can be obtained from the observed data?
What is an example?
This might not be the most efficient estimator.
For example, as implemented in most software packages, repeated measures (M)ANOVA uses only subjects for which the response has been observed at all time points.
- Complete case analysis suffers a loss of precision since information from cases with partially observed multivariate responses (48%) is ignored
What do we ideally want?
An analysis method that provides valid inferences in the presence of missing data and uses all the available information.
What does the intention to treat (ITT) principle refer to?
A type of analysis specific to RCTs, and states that all subjects should be analysed as part of the treatment group which they were originally assigned to, irrespectively of the level of treatment received and protocol adherence.
What is the purpose of the intention to treat principle?
This advice is aimed at maintaining the benefits of randomisation, that is avoiding confounding of the group effect (=avoiding selection bias).
What leads to a departure from the ITT principle which can introduce selection bias?
Missing values
Example:
* less chronically mentally ill patients may be less likely to adhere to intensive management and are then more likely to be lost to follow-up in this group.
- if more chronic cases also tend to have more psychopathology and intensive therapy is beneficial then the group difference will tend to be underestimated based on the observed data.
What is generalisability?
Extent to which study results apply to the target population.
What can missing data affect?
The generalisability of the results from a trial or an observational study.
What is an example of RCT?
Suppose the most severely ill were most likely to be lost to follow-up (in both randomisation groups)
Then the observed results would be representative of a population in which the less severely ill are over-represented
What is the aim of data analysis?
Inference for a target population.
All data analyses are based on model assumptions about what?
target population
sampling process
What sampling method is typically used?
Random sampling
When data are missing and analyses are based on observed data further assumptions are being
made for what reason?
To describe how the observed data came about
Formally, the missing value generating mechanism is the probability of what?
Missing value pattern given the values taken by the (later observed or missing) observations
- What does the probability of a missing value pattern not depend on?
- The observed data are in fact what and what is this mechanism also known as?
- Any observed or unobserved measurements or characteristics
- A random sample of the intended measures.
Examples:
* A lab sample is dropped.
* The interviewer overlooks a question by accident.
The mechanism is also known as uniform non-response.
A complete case analysis, albeit less precise, remains valid.
P(miss) may depend on what?
Some observed characteristics, but conditional on these characteristics the probability of a missing pattern does not depend on unobserved data (in other words, it is MCAR or random within classes of the observed characteristics).
Example:
- Chance of missing a language test different in boys and girls
What are MAR and MCAR referred to as?
Non-informative or ignorable mechanisms, because if MCAR or MAR holds, analyses can ignore P(miss)
What does MCAR stand for?
Missing Completely at Random)
What does MAR stand for?
Missing at random
Even after considering the information in the observed data, the reason for a value being missing still depends on what?
.
The unseen observations.
Example:
Patients miss their hospital appointments because treatment has deteriorated their condition
Such a mechanism is referred to as an informative MV mechanism.
No mainstream software exists to deal with MNAR data.
True or false
TRUE
If the data are missing by design then what do we know?
The mechanism by which they were generated (MCAR or MAR).
If the data are not missing by design then what do we have to choose between?
An informative and non-informative missingness mechanism on
theoretical grounds (subject-matter).
We can observe all the variables that drive missingness under MNAR.
FALSE
We can never observe all the variables that drive missingness under MNAR.
It not possible to determine empirically whether the mechanism by which MNAR is generated is informative or not
TRUE
How can we look at departures from MCAR?
By assessing whether any (fully) observed variables are associated with the MV mechanism.
What methods can be used to assess departures from MCAR?
Formal or informal methods may be used.
E.g. for a set of fully observed baseline variables
- compare summaries of baseline variables between subjects with different MV patterns
- plot respective summaries against MV patterns
- model probabilities of MV patterns as function of baseline variables and test their effects
e.g. using a logistic regression model, whether the response is an indicator variable of missingness in the variable of interest, coded 1 for observed values and 0 for missing values.
What is multiple imputation?
A three-step process which helps us to analyse data that are missing at random.
Before we carry out multiple imputation, what do we have to find?
Fully observed variables that are correlated with the partially observed data (and maybe with P(miss)).
What is step 1 of a multiple imputation?
We multiply impute (fill-in) the missing values with values randomly drawn from a distribution
- We do this by deploying what is called the imputation model.
- The imputation model relies on the correlates of the incomplete data.
- We create several different multiply imputed datasets.
What is step 2 of a multiple imputation?
Ee analyse each imputed dataset separately and obtain estimates for the quantities of interest
What is step 3 of a multiple imputation?
We combine the multiple estimates from step 2
What steps of analysis are involved in a multiple imputation?
- Setup
- Imputation
- Analysis
- Combining
- Postestimation
- Importing
- Data management
What is involved in the setup of a multiple imputation?
Choose an mi style (how imputations are stored)
* wide
* mlong
* flong
* flongsep
Register variables
* mi register imputed bmi
* mi register regular attach smokes age hsgrad female
What is the imputation step of a multiple imputation dependent upon?
Pattern and type of data.
There are different methods for univariate(situations where we only wish to impute one variable) monotone and arbitrary data.
What variable type follows a univariate pattern and what imputation method is used during the imputation stage?
Continous- regress, pmm, truncreg, intreg
Binary- logit
Categorical- ologit, mlogit
Count- poisson, nbreg
What variable type follows a monotone pattern and what imputation method is used during the imputation stage?
Mixture- monotone
What variable type follows an arbitrary pattern and what imputation method is used during the imputation stage?
Continuous- mvn
Mixture- chained
Multiple imputation using chained equations (ICE) is performed by what?
mi impute chained
How are variables imputed using chained equations (ICE)?
Variables are imputed iteratively using conditional univariate imputation models - Conditional meaning that each of the variables to be imputed is regressed on a number of fully observed variables
Stata first imputes variable with fewer imputed variables
What does inserting ‘regress’ before a variable we are imputing ensure?
Variable is imputed as continuous
What does inserting ‘logit’ before a variable we are imputing ensure?
Variable is imputed as binary
How can we analyse data in stata for an imputation?
mi estimate: estimation_command
* regress - Linear regression
* logit - Logistic regression
* poisson - Poisson regression
* stcox - Cox proportional hazards model
* glm - generalised linear model
* xtreg - Fixed- and random-effects linear regression
* mixed - Multilevel mixed-effects linear regression
* svy: Estimation commands for survey data
For a full list type help mi estimate
What does Stata classify all commands that can be used after main analysis as?
Postestimation
What are examples of post-estimation analysis?
Transformation
What concepts do not have clear interpretation within multiple imputation framework and therefore are not directly applicable to multiple imputation results?
likelihood tests
What are examples of post estimation analyses?
Transformation - use mi estimate and then need to calculate transformations within each imputed dataset.
Include as many transformations by calling transformation an arbitrary name e.g diff and then type in expression e.g difference between parameter estimate for smokes and bmi
Test whether more than one term are having a significant effect on our outcome when tested jointly