week 3 Data Cleaning (text books) Flashcards
to help us learn week 3 lecture and supplementary reading content :-) including: Review research design and decision tree for choosing your analytical technique. Cleaning your data; Data Entry Errors; Missing Data; Outliers; Non-normal distribution; Linearity; Homoscedasticity; Multicollinearity and Singularity so no problem then :-}
What is the basic logic used to interpret all inferential statistics?
The basic logic used to interpret all inferential statistics is the logic of significance, and the null & alternate hypothesis
What do parametric statistics depend on?
Parametric statistics depend on normality
What is the first step in the analysis process?
The first step in the analysis process is always to screen the data
Why is data screening the first step in the analysis process?
- To ensure accuracy
- To check assumptions
- After the screening process a decision needs to be made as to which statistical test to use
What are the 4 steps of data screening / data analysis?
Screening the data involves
- Checking the accuracy of the data entry
- Examining any missing data to access if there is evidence that they might be systematic (i.e. caused by something specific, as opposed to being due to random errors)
- Looking for outliers
- Checking parametric test assumptions (e.g. normality)
Why is it important to inspect my data file for missing data?
I need to consider if there is a variable with a lot of missing data, and if so, ask why; i.e.: Is it random or is there a systematic pattern?
How do I inspect my data file for missing data?
o Run “Descriptives” & find out what percentage of values is missing from each of my variables
o SPSS has a Missing Value Analysis (MVA) procedure to help find patterns in missing data
Why is caution recommended when managing missing values in my statistical analysis?
I should choose carefully as it can have dramatic results
This is particularly important if I am including a list of variables & repeating the same analysis for all variables (e.g. correlations among a group of variables, t-tests for a series of dependent variables)
What is the best way to manage missing values when I come to do my statistical analysis?
*Pallant strongly recommends the ‘pairwise’ exclusion of missing data
• ‘exclude cases listwise’ includes cases only if it has full data on all the variables.
• ‘exclude cases pairwise’ is better as it excludes the participant only if they are missing data required for a specific analysis. They will still be included in analysis for which they have necessary information
What should I never do when managing missing data using SPSS?
*I should NEVER run the ‘replace with mean option’ as it can severely distort the results of my analysis, particularly if I have a lot of missing values
• ‘replace with mean’ calculates the mean value for the variable & give every missing case this value
What do Tabachnick & Fidell advise when dealing with missing data?
The pattern of missing data is more important than the amount of data missing
So at what percentage of missing data do we start to grow concerned?
o If less than 5% of data are missing in a random pattern from a large data set, the problems are less serious & almost any procedure for handling missing values yields similar results
o If, however, a lot of data are missing from a small to moderately sized data set, problems can be very serious
o Unfortunately, there are as yet no firm guidelines for how much missing data can be tolerated for a sample of a given size
How many types of missing data are there?
There are 3 types of missing data:
MCAR – missing completely at random
MAR – missing at random (ignorable responses)
MNAR – missing not at random or not-ignorable
What can you tell me about MCAR?
MCAR – missing completely at random
The distribution of missing data is unpredictable in MCAR
This is the best of all possible worlds if data must be missing
What can you tell me about MAR?
MAR – missing at random (ignorable responses)
The pattern of missing data is predictable from other variables in the data set