Week 3 Data Cleaning Flashcards
to provide information from Slides 9 - onwards (separate slides for decision trees)
what are some of the issues with data collection?
- Data Entry Errors
- Missing Data
- Outliers
- Non-normal distribution
- Linearity
- Homoscedasticity
- Multicollinearity and Singularity
Why is it necessary to be fastidious in cleaning your data?
- Conclusions drawn from inferential statistics are dependent upon not having violated the assumptions underpinning the technique used.
- Applying principles that generalize the findings.
- Data cleaning helps you understand your data to draw conclusions.
- Using small samples makes it even more important to ensure the data is representative and generalisable.
- We infer from our sample data what the population may do or think – we are generalising to the broader community when using parametric statistics.
Remembering that my data is precious, it tells a story, what is the best method for checking the accuracy of my data?
- physically comparing the original data with the onscreen entry
- However, this is only practical with a small data set
- Stats packages offer alternatives to help screen my data for accuracy
- Descriptive statistics and graphical representations are a good start
e. g. unusual data entries can be identified visually or sometimes by assessing outliers.
Why should I look for out of range values?
- out of range values may distort my results & are not representative.
- data entry may be incorrect: e.g. a score of 12 on a scale of 1-7 is inaccurate.
- thus the sequence may also be inaccurate
What is missing data, and why does it matter?
*Missing Data is information not available for a subject (or case) about whom other information is available.
*Usually occurs when the participant fails to answer one or more questions in a survey.
I need to consider is it Systematic or Random?
- I need to look for patterns and relationships underlying the missing data to maintain as close as possible values in the original distribution when a remedy is applied.
*Impact can reduce sample size available for analysis and can also distort results.
What are the principles of missing data screening?
- Work out if it’s spread randomly across the dataset or is systematically missing – check patterns of missingness.
- Number (amount) of missing data points is secondary .
- How you deal with your missing data needs to be directed by stringent criteria
- Regardless of my decision, any transformations, deletions or adjustments need to be referenced succinctly in my results section.
I need to deal with missing data prior to Cleaning & Analysis - but how do I do this?
I can can test for levels of missingness by creating a dummy-coded variable (0/1) and then using a t-test to assess whether there is a mean difference between the 2 groups on the DV of interest.
*If this is not significantly different it is not as critical in the way I deal with the data although degree of missingness is important to look at next (more than 5% is not great).
There are 3 kinds of missing data, what are they?
The data may be:
*Missing Completely at Random – unpredictable (MCAR)
*Missing At Random but ignorable response (MAR)
*Missing not at random or non-ignorable (MNAR)
MNAR is the worst kind :-(
What are the 3 alternatives that used to be employed to handle missing data? and why are they not the most appropriate?
Historically, missing data has been handled using either listwise deletion of the entire case or, pairwise deletion (when the item missing is involved in any calculation) or replacement of the mean value of the item in question.
Primarily, using listwise deletion would eliminate other very relevant information and although pairwise deletion would lessen this effect, it would still eliminate substantial, valuable data. Instead data transformation appears to be the optimal way in dealing with the issue of missing values, rather than eliminating the pattern of responses from the individual being assessed.
SPSS will use list-wise or pairwise deletion to deal with data during most analyses as the default. Why is this not a great idea?
A better method to avoid loss of valuable data or inappropriate imputation, is achieved by either Regression or EM (expectation maximisation – Model-Based Methods) replacement.
Systematic replacement of missing values may be used in your research using SPSS, however, remember to reference any changes made to the actual raw data set in your results section.
Why is Regression Replacement (RP) a better option for handling missing data?
The regression option provides an assigned mean value which takes into account the pattern of responses from individual cases on all other variables and supplies an adjusted mean for each participant through a regression analysis
What other things do I need to remember when considering missing data?
*Missing data under 5% for an individual case or observation can generally be ignored, except when the missing data occurs in a specific nonrandom fashion (e.g., concentration in a specific set of questions, attrition at the end of the questionnaire, etc.). Particularly relevant in a large dataset.
The number of cases with no missing data must be sufficient for the selected analysis technique to be used if replacement values are not being substituted (imputed) for the missing data.
So what if I have more than 5% data missing?
Under 10% – Any of the data replacement methods can be applied when missing data is this low, although the complete case method has been shown to be the least preferred.
10 to 20% – The increased presence of missing data makes the all available, regression methods most preferred for MCAR data & model-based methods are necessary with MAR missing data processes.
Why is Expectation Maximisation (EM) so highly regarded when dealing with missing data?
Expectation Maximisation (EM):
- Uses data to generate the shape of the distribution
- Uses the observed values and the estimate of the parameters to generate substitutions for the missing values
- Based on the likelihood of those assigned values occurring under that distribution
- Unlike regression, expectation maximisation produces realistic estimates of the variance but it is much more difficult to undertake.
So, what does the Regression Method of Data replacement have to offer?
- Other variables are used as IVs and the variable with missing data as the DV. The variables with complete cases generate the regression equation and this is used to predict the assigned missing values
- It requires IVs to be good predictors of the variable with the missing data. It reduces variance and is more consistent with other scores than a “real” score (remember – they are used to predict it)
What is the difference between EM and Regression Replacement?
EM will give you the estimated maximum likelihood replacement, where the regression model will give you take the pattern of each case into account for the replacement