W6: Missing Data Flashcards
Name 2 reasons why missing data is problematic.
- Results from non-missing data = biased
- Causes loss of efficiency
What is list-wise deletion?
Analysing only complete cases
E.g discard data in x and z because observation on y is missing
What is MCAR (missing completely at random)?
Missingness completely independent of the estimate of our parameter(s) of interest
E.g dog is the missingness mechanism of homework with missing values regardless of value of homework
Is there an empirical way to determine which missing data assumption is correct? (I.e MCAR / MAR / MNAR?)
No
What is 1 way to assess missing data assumptions?
Sensitivity analysis: what happens to results by comparing imputed data and list-wise deletion
What is MAR (missing at random)?
When missingness is conditionally independent of the unobserved estimate of our parameter(s) of interests
Conditionally dependent on values of variable we observed
E.g dog eats homework only if the student is female
What is MNAR (missing not at random)?
Missingness is associated with the estimate of our parameter(s) of interest
Dependent on unobserved/uncollected values of the outcome of interest
E.g dog eats bad homework, missingness = directly related to the parameters of interest
Will listwise deletion yield biased or unbiased estimates for
a) MCAR?
b) MAR?
c) MNAR
a) Unbiased
b) Possible to recover unbiased estimates if the variables missing values are conditioned on are present
But biased if only using complete cases (i.e from list wise deletion)
c) Cannot recover unbiased estimates (data needed to recover them is itself missing)
What are multiple imputations?
Estimation of new value for the missing value over multiple datasets
What is the outcome of multiple imputations?
Pooled estimates (includes sampling uncertainty (V hat) so it’s not called average estimates)
What type of uncertainty/variation is calculated in each imputed dataset?
Sampling variation, overall uncertainty estimate (V hat)
What 3 types of variation/uncertainty does the average of a set of imputed estimates (i.e. pooled estimates) have? These add up to total uncertainty
- Sampling variation (V hat)
- Between-imputed dataset variation / Missing data uncertainty
(estimate of uncertainty from changes in imputed missing data itself across imputed datasets) - Uncertainty from not generating an infinite number of imputed datasets (B/M)
What can you do to lower the level of uncertainty in multiple imputations?
Is it a good solution?
Increase m, the number of imputed datasets
No, get diminishing returns and slow analysis
How many imputed datasets are recommended?
25-100
MI steps in R (step 1)
What dataset do you start with?
Raw dataset with missing data