Missing Data Flashcards

Question 1

Q

Types of missing data (3)

Answer

A

1) Missing completely at random - we remove elements from Xj at random 2) missing at random - the pattern of missingness depends on other predictors (observed or not) 3) censoring - the pattern of missingness is closely related to missing variable

Question 2

Q

Ways to deal with missing data (3)

Answer

A

tree-based methods deal with missing data naturally, single imputation, multiple imputation

Question 3

Q

Explain single imputation (overall, options, repercussions)

Answer

A

overall: replace each missing value with a single number.
options: replacement value could be: {mean, median}, random sample from non-missing values in each column, regression estimate from predictors Xj-
repercussions: methods 1,2 can give biased coefficients if data isn’t missing completely at random, method 3 doesn’t have bias if the missingness is predicted well by Xj, method 3 has std errors that are artificially small

Question 4

Q

Explain multiple imputation

Answer

A

overall: we replace each missing value in Xj with regression estimates from other predictors Xj-, plus some noise. This is repeated several times

If the regression fit of Xj from Xj- is good, the std errors can be unbiased

you could also bootstrap

Question 5

Q

Missing data in more than 1 variable

Answer

A

1) iterative multiple imputation: start with single imputation, then repeated multiple imputation of Xj from X-j
2) matrix completion

Question 6

Q

Matrix completion

Answer

A

yhat is the projection of y onto the space spanned by the columns of X (the column space is what determines the fit). in matrix completion, algorithms find many X’ choices which are full and close to X. you choose the X’ with the lowest rank (e.g, smallest distance between X and X’).

Question 7

Q

6 practical considerations/questions

Answer

A

1) visualize plots to find patterns of missingness (POM) 2) if POM is informative, include it as a dummy variable 3) if an Xj has too many missing points, is it worth to include it? 4) if your method allows it, weigh predictors according to rate of missingness 5) some predictors can be bounded (e.g., just positive) 6) are there any variables that are non-linear fcns of others?