Missing data concepts and multiple imputation with Stata Flashcards

Question 1

Q

What are missing data?

Answer

A

Observations that could have been made but were not.

Question 2

Q

How can values be missed?

Answer

A

By design (intentionally), or unintentionally

Question 3

Q

In randomised controlled trials how are missing values most likely to occur?

Answer

A

In outcome variable(s)

that is in terms of statistical analysis models, the response variables have MVs, while the explanatory
variables tend to be fully observed.
Randomisation groups, baseline values of the outcome variable, centres for multi-centre trials typically fully available

Question 4

Q

In observational studies the explanatory variables (covariates) are just as likely to contain what?

Answer

A

Missing values as the outcome measures, that is:

the explanatory variable of interest may contain MVs
or the covariates included in the model for the purpose of explaining background variability or
adjustment (confounders) may contain MVs

Question 5

Q

Why can’t we simply analyse the observed data using an appropriate analysis method?

Answer

A

There are a number of potential problems:

Estimation method no longer valid
Loss of precision
Departure from the intention-to-treat principle (RCTs)
Lack of generalisability

Question 6

Q

What is the main issue with missing data?

Question 7

Q

What is the best solution to dealing with missing data?

Question 8

Q

A valid estimator is one that is what?

Answer

A

unbiased for the parameter of interest
and its precision (standard error) can be quantified

Question 9

Q

What is an example of providing a valid estimator?

Answer

A

If there were no missing data then fitting an ANCOVA model to a RCT that deployed a before-after design, would provide a valid estimator for the therapy difference.

Question 10

Q

What is a problem that affects the use of valid estimators?

Answer

A

In the presence of missing data analysis methods that would provide valid estimators for the complete data do not necessarily provide valid estimators when applied to the observed data.

Question 11

Q

What issue may still persist even under circumstances where a valid estimator can be obtained from the observed data?

What is an example?

Answer

A

This might not be the most efficient estimator.

For example, as implemented in most software packages, repeated measures (M)ANOVA uses only subjects for which the response has been observed at all time points.

Complete case analysis suffers a loss of precision since information from cases with partially observed multivariate responses (48%) is ignored

Question 12

Q

What do we ideally want?

Answer

A

An analysis method that provides valid inferences in the presence of missing data and uses all the available information.

Question 13

Q

What does the intention to treat (ITT) principle refer to?

Answer

A

A type of analysis specific to RCTs, and states that all subjects should be analysed as part of the treatment group which they were originally assigned to, irrespectively of the level of treatment received and protocol adherence.

Question 14

Q

What is the purpose of the intention to treat principle?

Answer

A

This advice is aimed at maintaining the benefits of randomisation, that is avoiding confounding of the group effect (=avoiding selection bias).

Question 15

Q

What leads to a departure from the ITT principle which can introduce selection bias?

Answer

A

Missing values

Example:
* less chronically mentally ill patients may be less likely to adhere to intensive management and are then more likely to be lost to follow-up in this group.

if more chronic cases also tend to have more psychopathology and intensive therapy is beneficial then the group difference will tend to be underestimated based on the observed data.

Question 16

Q

What is generalisability?

Answer

A

Extent to which study results apply to the target population.

Question 17

Q

What can missing data affect?

Answer

A

The generalisability of the results from a trial or an observational study.

Question 18

Q

What is an example of RCT?

Answer

A

Suppose the most severely ill were most likely to be lost to follow-up (in both randomisation groups)

Then the observed results would be representative of a population in which the less severely ill are over-represented

Question 19

Q

What is the aim of data analysis?

Answer

A

Inference for a target population.

Question 20

Q

All data analyses are based on model assumptions about what?

Answer

A

target population

sampling process

Question 21

Q

What sampling method is typically used?

Answer

A

Random sampling

Question 22

Q

When data are missing and analyses are based on observed data further assumptions are being
made for what reason?

Answer

A

To describe how the observed data came about

Question 23

Q

Formally, the missing value generating mechanism is the probability of what?

Answer

A

Missing value pattern given the values taken by the (later observed or missing) observations

Question 24

Q

What does the probability of a missing value pattern not depend on?
The observed data are in fact what and what is this mechanism also known as?

Answer

A

Any observed or unobserved measurements or characteristics
A random sample of the intended measures.

Examples:
* A lab sample is dropped.
* The interviewer overlooks a question by accident.

The mechanism is also known as uniform non-response.

A complete case analysis, albeit less precise, remains valid.

Question 25

Q

P(miss) may depend on what?

Answer

A

Some observed characteristics, but conditional on these characteristics the probability of a missing pattern does not depend on unobserved data (in other words, it is MCAR or random within classes of the observed characteristics).

Example:
- Chance of missing a language test different in boys and girls

Question 26

Q

What are MAR and MCAR referred to as?

Answer

A

Non-informative or ignorable mechanisms, because if MCAR or MAR holds, analyses can ignore P(miss)

Question 27

Q

What does MCAR stand for?

Answer

A

Missing Completely at Random)

Question 28

Q

What does MAR stand for?

Answer

A

Missing at random

Question 29

Q

Even after considering the information in the observed data, the reason for a value being missing still depends on what?
.

Answer

A

The unseen observations.

Example:
Patients miss their hospital appointments because treatment has deteriorated their condition

Such a mechanism is referred to as an informative MV mechanism.

Question 30

Q

No mainstream software exists to deal with MNAR data.

True or false

Question 31

Q

If the data are missing by design then what do we know?

Answer

A

The mechanism by which they were generated (MCAR or MAR).

Question 32

Q

If the data are not missing by design then what do we have to choose between?

Answer

A

An informative and non-informative missingness mechanism on
theoretical grounds (subject-matter).

Question 33

Q

We can observe all the variables that drive missingness under MNAR.

Answer

A

FALSE

We can never observe all the variables that drive missingness under MNAR.

Question 34

Q

It not possible to determine empirically whether the mechanism by which MNAR is generated is informative or not

Question 35

Q

How can we look at departures from MCAR?

Answer

A

By assessing whether any (fully) observed variables are associated with the MV mechanism.

Question 36

Q

What methods can be used to assess departures from MCAR?

Answer

A

Formal or informal methods may be used.

E.g. for a set of fully observed baseline variables

compare summaries of baseline variables between subjects with different MV patterns
plot respective summaries against MV patterns
model probabilities of MV patterns as function of baseline variables and test their effects

e.g. using a logistic regression model, whether the response is an indicator variable of missingness in the variable of interest, coded 1 for observed values and 0 for missing values.

Question 37

Q

What is multiple imputation?

Answer

A

A three-step process which helps us to analyse data that are missing at random.

Question 38

Q

Before we carry out multiple imputation, what do we have to find?

Answer

A

Fully observed variables that are correlated with the partially observed data (and maybe with P(miss)).

Question 39

Q

What is step 1 of a multiple imputation?

Answer

A

We multiply impute (fill-in) the missing values with values randomly drawn from a distribution

We do this by deploying what is called the imputation model.
The imputation model relies on the correlates of the incomplete data.
We create several different multiply imputed datasets.

Question 40

Q

What is step 2 of a multiple imputation?

Answer

A

Ee analyse each imputed dataset separately and obtain estimates for the quantities of interest

Question 41

Q

What is step 3 of a multiple imputation?

Answer

A

We combine the multiple estimates from step 2

Question 42

Q

What steps of analysis are involved in a multiple imputation?

Answer

A

Setup
Imputation
Analysis
Combining
Postestimation

Importing
Data management

Question 43

Q

What is involved in the setup of a multiple imputation?

Answer

A

Choose an mi style (how imputations are stored)
* wide
* mlong
* flong
* flongsep

Register variables
* mi register imputed bmi
* mi register regular attach smokes age hsgrad female

Question 44

Q

What is the imputation step of a multiple imputation dependent upon?

Answer

A

Pattern and type of data.

There are different methods for univariate(situations where we only wish to impute one variable) monotone and arbitrary data.

Question 45

Q

What variable type follows a univariate pattern and what imputation method is used during the imputation stage?

Answer

A

Continous- regress, pmm, truncreg, intreg

Binary- logit

Categorical- ologit, mlogit

Count- poisson, nbreg

Question 46

Q

What variable type follows a monotone pattern and what imputation method is used during the imputation stage?

Answer

A

Mixture- monotone

Question 47

Q

What variable type follows an arbitrary pattern and what imputation method is used during the imputation stage?

Answer

A

Continuous- mvn

Mixture- chained

Question 48

Q

Multiple imputation using chained equations (ICE) is performed by what?

Answer

A

mi impute chained

Question 49

Q

How are variables imputed using chained equations (ICE)?

Answer

A

Variables are imputed iteratively using conditional univariate imputation models - Conditional meaning that each of the variables to be imputed is regressed on a number of fully observed variables

Stata first imputes variable with fewer imputed variables

Question 50

Q

What does inserting ‘regress’ before a variable we are imputing ensure?

Answer

A

Variable is imputed as continuous

Question 51

Q

What does inserting ‘logit’ before a variable we are imputing ensure?

Answer

A

Variable is imputed as binary

Question 52

Q

How can we analyse data in stata for an imputation?

Answer

A

mi estimate: estimation_command
* regress - Linear regression
* logit - Logistic regression
* poisson - Poisson regression
* stcox - Cox proportional hazards model
* glm - generalised linear model
* xtreg - Fixed- and random-effects linear regression
* mixed - Multilevel mixed-effects linear regression
* svy: Estimation commands for survey data

For a full list type help mi estimate

Question 53

Q

What does Stata classify all commands that can be used after main analysis as?

Answer

A

Postestimation

Question 54

Q

What are examples of post-estimation analysis?

Answer

A

Transformation

Question 55

Q

What concepts do not have clear interpretation within multiple imputation framework and therefore are not directly applicable to multiple imputation results?

Answer

A

likelihood tests

Question 56

Q

What are examples of post estimation analyses?

Answer

A

Transformation - use mi estimate and then need to calculate transformations within each imputed dataset.
Include as many transformations by calling transformation an arbitrary name e.g diff and then type in expression e.g difference between parameter estimate for smokes and bmi

Test whether more than one term are having a significant effect on our outcome when tested jointly