Week 3 - Basic Data Cleaning/Missing Data Flashcards
3 sources of missing data
- From some participants
- From some variables
- From a subset of people/measures (only some participants didn’t provide a response on a particular variable)
What question is asked when there is data missing from some participants?
Are people who didn’t provide data somehow different from those who did provide data?
Question that is asked when there is data missing from some variables?
Why would people not provide data here? Is the study affected?
Question that is asked when data is missing from a subset of people/measures
Why would only some people withhold a response to some items?
Why care about missing data?
Can influence how representative that sample is of the population we wish to generalise to
1. Undermines validity
- Estimated parameters might not be equivalent to population parameters
- If estimated parameters are biased, it undermines validity in the study
- No longer a true reflection of what’s happening in the population
2. Can compromise statistical power
- Reducing sample size
- Influencing ability to correctly detect an effect if it exists
What checks are required to determine how problematic missing data might be?
If participants and variables have been adequately assessed and determining the pattern of missing data
Why is it a problem if some participants haven’t been adequately sampled?
It’s as if haven’t participated in study at all:
- Reduces power
- If due to systematic reasons, the validity of study is undermined
Why is it a problem if some variables haven’t been adequately assessed?
It’s as if haven’t assessed the item at all:
- Can compromise ability to address research questions (especially if on variables of interest)
- If due to systematic reasons, the validity of our study is undermined and can lead to inaccurate conclusions
What is meant by biased estimated parameters
e.g., absence of high scorers underestimates the mean. In addition, relationships between that variable and other variables are likely to be weakened because of restriction of range (majority of scores are low)
Pattern of missing data:
- can’t predict when score will be missing from dataset
- can’t predict what the value of datapoint would be, given that it is missing
DATA NOT MISSING SYSTEMATICALLY
Missing Completely At Random (MCAR)
Pattern of missing data:
- can predict when a score will be missing from dataset
- cannot predict what the value of datapoint would have been
DATA ARE MISSING SYSTEMATICALLY BUT DOESNT INTRODUCE BIAS
Missing At Random (MAR)
results still not generalisable because basing conclusions on majority low endorsement of masculinity, for example, but not introducing a huge amount of bias. Not getting findings completely wrong
Pattern of missing data:
- may (or may not) be able to predict when a score will be missing from dataset
- Can predict what the value of the datapoint is likely to be
DATA ARE MISSING SYSTEMATICALLY AND DOES INTRODUCE BIAS
Missing Not At Random (MNAR)
e.g., if they’re not giving data then they’re going to be a high scorer
How to deal with inadequately sampled participants?
Can delete them as basically as if didn’t participate anyway (if not attributed to systematic factors and unlikely to bias estimates of parameters)
2 types of deletion strategies when dealing with MAR and MCAR data
Listwise deletion
Pairwise deletion
Deletion strategy that removes any participants with any missing data from all analyses
Listwise