Missing Data Flashcards

1
Q

Types of missing data (3)

A

1) Missing completely at random - we remove elements from Xj at random 2) missing at random - the pattern of missingness depends on other predictors (observed or not) 3) censoring - the pattern of missingness is closely related to missing variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Ways to deal with missing data (3)

A

tree-based methods deal with missing data naturally, single imputation, multiple imputation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain single imputation (overall, options, repercussions)

A

overall: replace each missing value with a single number.
options: replacement value could be: {mean, median}, random sample from non-missing values in each column, regression estimate from predictors Xj-
repercussions: methods 1,2 can give biased coefficients if data isn’t missing completely at random, method 3 doesn’t have bias if the missingness is predicted well by Xj, method 3 has std errors that are artificially small

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain multiple imputation

A

overall: we replace each missing value in Xj with regression estimates from other predictors Xj-, plus some noise. This is repeated several times

If the regression fit of Xj from Xj- is good, the std errors can be unbiased

you could also bootstrap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Missing data in more than 1 variable

A

1) iterative multiple imputation: start with single imputation, then repeated multiple imputation of Xj from X-j
2) matrix completion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Matrix completion

A

yhat is the projection of y onto the space spanned by the columns of X (the column space is what determines the fit). in matrix completion, algorithms find many X’ choices which are full and close to X. you choose the X’ with the lowest rank (e.g, smallest distance between X and X’).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

6 practical considerations/questions

A

1) visualize plots to find patterns of missingness (POM) 2) if POM is informative, include it as a dummy variable 3) if an Xj has too many missing points, is it worth to include it? 4) if your method allows it, weigh predictors according to rate of missingness 5) some predictors can be bounded (e.g., just positive) 6) are there any variables that are non-linear fcns of others?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly