Week 4 : Item Non-Response and Imputation Methods Flashcards
What are the 3 different reasons as to why item non-response may occur?
- Respondent = answer is not known, refusal or even an accidental skip
- Interviewer = does not ask question or does not record response
- Processing = response rejected at editing
In which variables is item non-response the highest?
Financial variables and for derived variables, e.g. total household income from all sources.
What is the issue with complete-case analysis?
- Complete case analysis deletes all units with incomplete data (in the variables involved)
- It is inefficient
- It is problematic in regression comparing models
- May give biased estimates and invalid inferences.
List the 4 imputation methods.
- Mean imputation
- Deterministic imputation
- Hot deck imputation
- Model-based imputation
What does imputation do?
Reduces bias
How does simple mean imputation work?
Impute all missing values of y by respondent sample mean of y, if y is continuous.
How does simple mode imputation work?
Impute by respondent sample mode if categorical variable e.g. number of cars.
- Or create a missing value category
List the issues with mean imputation.
- Associations tend to be diluted by pulling estimate of correlation toward zero
- Distorts distribution
- Variance will be wrongly estimated (typically underestimated) if the imputed values are treated as real
- Thus inferences will be wrong too.
How does deterministic and logic-based imputation work?
- Impute using logical rules:
e.g. Age = 9, so deduce marital status = single
Y1 = no. of dependent children, Y2 = no. of non-dependent children, Y3 = no. of children
Y1 + Y2 = Y3
If Y1 and Y2 are missing can deduce value from Y3 - Last observation carried forward
> Specific to longitudinal data
> Replace by the last observed value but is problematic for variables that can change (e.g. income)
How does hot-deck imputation work?
Replace missing values with the last observed value.
What is sequential hot deck?
- Records are ordered
- Impute value from previous record in the same class (needs some starting values)
List the 3 issues with sequential hot deck.
- If few classes, limited control
- If many classes, often multiple use of donors
- Choice of starting values is important
What is the alternative hot deck method?
Hierarchical hot deck
How does hierarchical hot deck work?
- Sort file of respondents hierarchically by variables
- Then match as many variables as possible, making final choice random
What is the problem with most imputation methods?
They do not reflect sampling variation and uncertainty about regression coefficients.
How does multiple imputation reflect sampling variation?
Creates several (e.g. five) imputed values for each missing value, each of which is predicted from a slightly different model and each of which also reflects sampling variability.
When should we use hot deck imputation?
For categorical variables with few missing cases.
When should we use random regression imputation?
For variables with substantial rates of missing (>10%), especially continuous variables.