Week 4 : Item Non-Response and Imputation Methods Flashcards
What are the 3 different reasons as to why item non-response may occur?
- Respondent = answer is not known, refusal or even an accidental skip
- Interviewer = does not ask question or does not record response
- Processing = response rejected at editing
In which variables is item non-response the highest?
Financial variables and for derived variables, e.g. total household income from all sources.
What is the issue with complete-case analysis?
- Complete case analysis deletes all units with incomplete data (in the variables involved)
- It is inefficient
- It is problematic in regression comparing models
- May give biased estimates and invalid inferences.
List the 4 imputation methods.
- Mean imputation
- Deterministic imputation
- Hot deck imputation
- Model-based imputation
What does imputation do?
Reduces bias
How does simple mean imputation work?
Impute all missing values of y by respondent sample mean of y, if y is continuous.
How does simple mode imputation work?
Impute by respondent sample mode if categorical variable e.g. number of cars.
- Or create a missing value category
List the issues with mean imputation.
- Associations tend to be diluted by pulling estimate of correlation toward zero
- Distorts distribution
- Variance will be wrongly estimated (typically underestimated) if the imputed values are treated as real
- Thus inferences will be wrong too.
How does deterministic and logic-based imputation work?
- Impute using logical rules:
e.g. Age = 9, so deduce marital status = single
Y1 = no. of dependent children, Y2 = no. of non-dependent children, Y3 = no. of children
Y1 + Y2 = Y3
If Y1 and Y2 are missing can deduce value from Y3 - Last observation carried forward
> Specific to longitudinal data
> Replace by the last observed value but is problematic for variables that can change (e.g. income)
How does hot-deck imputation work?
Replace missing values with the last observed value.
What is sequential hot deck?
- Records are ordered
- Impute value from previous record in the same class (needs some starting values)
List the 3 issues with sequential hot deck.
- If few classes, limited control
- If many classes, often multiple use of donors
- Choice of starting values is important
What is the alternative hot deck method?
Hierarchical hot deck
How does hierarchical hot deck work?
- Sort file of respondents hierarchically by variables
- Then match as many variables as possible, making final choice random
What is the problem with most imputation methods?
They do not reflect sampling variation and uncertainty about regression coefficients.