week 3 Data Cleaning (text books) Flashcards by Andrea Jones

What is the basic logic used to interpret all inferential statistics?

The basic logic used to interpret all inferential statistics is the logic of significance, and the null & alternate hypothesis

How well did you know this?

Not at all

Perfectly

What do parametric statistics depend on?

Parametric statistics depend on normality

How well did you know this?

Not at all

Perfectly

What is the first step in the analysis process?

The first step in the analysis process is always to screen the data

How well did you know this?

Not at all

Perfectly

Why is data screening the first step in the analysis process?

To ensure accuracy
To check assumptions
After the screening process a decision needs to be made as to which statistical test to use

How well did you know this?

Not at all

Perfectly

What are the 4 steps of data screening / data analysis?

Screening the data involves

Checking the accuracy of the data entry
Examining any missing data to access if there is evidence that they might be systematic (i.e. caused by something specific, as opposed to being due to random errors)
Looking for outliers
Checking parametric test assumptions (e.g. normality)

How well did you know this?

Not at all

Perfectly

Why is it important to inspect my data file for missing data?

I need to consider if there is a variable with a lot of missing data, and if so, ask why; i.e.: Is it random or is there a systematic pattern?

How well did you know this?

Not at all

Perfectly

How do I inspect my data file for missing data?

o Run “Descriptives” & find out what percentage of values is missing from each of my variables
o SPSS has a Missing Value Analysis (MVA) procedure to help find patterns in missing data

How well did you know this?

Not at all

Perfectly

Why is caution recommended when managing missing values in my statistical analysis?

I should choose carefully as it can have dramatic results
This is particularly important if I am including a list of variables & repeating the same analysis for all variables (e.g. correlations among a group of variables, t-tests for a series of dependent variables)

How well did you know this?

Not at all

Perfectly

What is the best way to manage missing values when I come to do my statistical analysis?

*Pallant strongly recommends the ‘pairwise’ exclusion of missing data
• ‘exclude cases listwise’ includes cases only if it has full data on all the variables.
• ‘exclude cases pairwise’ is better as it excludes the participant only if they are missing data required for a specific analysis. They will still be included in analysis for which they have necessary information

How well did you know this?

Not at all

Perfectly

What should I never do when managing missing data using SPSS?

*I should NEVER run the ‘replace with mean option’ as it can severely distort the results of my analysis, particularly if I have a lot of missing values
• ‘replace with mean’ calculates the mean value for the variable & give every missing case this value

How well did you know this?

Not at all

Perfectly

What do Tabachnick & Fidell advise when dealing with missing data?

The pattern of missing data is more important than the amount of data missing

How well did you know this?

Not at all

Perfectly

So at what percentage of missing data do we start to grow concerned?

o If less than 5% of data are missing in a random pattern from a large data set, the problems are less serious & almost any procedure for handling missing values yields similar results
o If, however, a lot of data are missing from a small to moderately sized data set, problems can be very serious
o Unfortunately, there are as yet no firm guidelines for how much missing data can be tolerated for a sample of a given size

How well did you know this?

Not at all

Perfectly

How many types of missing data are there?

There are 3 types of missing data:

MCAR – missing completely at random
MAR – missing at random (ignorable responses)
MNAR – missing not at random or not-ignorable

How well did you know this?

Not at all

Perfectly

What can you tell me about MCAR?

MCAR – missing completely at random
The distribution of missing data is unpredictable in MCAR
This is the best of all possible worlds if data must be missing

How well did you know this?

Not at all

Perfectly

What can you tell me about MAR?

MAR – missing at random (ignorable responses)

The pattern of missing data is predictable from other variables in the data set

How well did you know this?

Not at all

Perfectly

What can you tell me about MNAR?

Study These Flashcards

MNAR – missing not at random or not-ignorable

The missingness is related to the variable itself & therefore cannot be ignored

What do Tabachnick & Fidell advise when dealing with missing data?

Study These Flashcards

The pattern of missing data is more important than the amount of data missing

So at what percentage of missing data do we start to grow concerned?

Study These Flashcards

How many types of missing data are there?

Study These Flashcards

There are 3 types of missing data:

MCAR – missing completely at random
MAR – missing at random (ignorable responses)
MNAR – missing not at random or not-ignorable

What can you tell me about MCAR?

Study These Flashcards

MCAR – missing completely at random
The distribution of missing data is unpredictable in MCAR
This is the best of all possible worlds if data must be missing

What are outliers & how can you identify them?

Study These Flashcards

Identify them via box plots or eyeballing histograms
o If outliers are identified: Standardise them by converting the raw scores to z scores
o Z scores in excess of +/- 3.29 (p<.001) may be regarded as outliers
o When N is large (large sample size) a few extreme scores are expected

Why are outliers a problem?

Study These Flashcards

Outliers are a particular problem for parametric tests
Outliers are extreme scores which distort the results
Outliers drag the mean towards themselves and exert ‘more than their fair share’ of influence over the mean & the results of statistical tests
Outliers can be the cause of non-normality

What is the temptation when dealing with missing data?

Study These Flashcards

• It is tempting to assume data is missing at random, however, we can test it & we should
o It is possible to create dummy variables with 2 groups: cases with missing & cases with non-missing values & perform a test of mean differences between the 2 groups
o If there are no differences, decisions about how to handle missing data are not so critical
o If there are significant differences, & η² is substantial, care is needed to preserve the cases with missing values from other analysis

What options does SPSS offer us to deal with Missing Data?

Study These Flashcards

SPSS MVA (Missing Values Analysis) is specifically designed to highlight patterns of missing data as well as to replace them in a data set
The resulting Little's MCAR test tells us what to do next

What does Little's MCAR test establish?

• Little’s MCAR test establishes whether the data are missing completely at random o A statistically nonsignificant result is desired. o MCAR can be inferred where p = greater than .05

What are the results we do not wish to see from a Little's MCAR test?

o MAR can be inferred if the MCAR test is statistically significant but missingness is predictable from variables (other than the DV) – this is established using the ‘Separate Variances t Tests’ o MNAR is inferred if the t test shows that missingness is related to the DV

Why are outliers a problem?

* Outliers are a particular problem for parametric tests * Outliers are extreme scores which distort the results * Outliers drag the mean towards themselves and exert ‘more than their fair share’ of influence over the mean & the results of statistical tests * Outliers can be the cause of non-normality

What are some of the options available to us when dealing with outliers?

o Delete them but this changes the nature of the sample (only delete if it’s believed the score is not really from the population of interest (e.g. below average IQ when intended population is average IQ) o Change outliers so they are only one unit more extreme than the next most extreme score (e.g. change from 45 to 24 where next most extreme score is 23).  This ensures outlier still exerts influence, but no more than its fair share o Can transform the data or use nonparametric statistics (as these are based on ranks thus not affected by extreme scores, and do not require normality)

What are some of the options available to us when dealing with outliers?

o Delete them but this changes the nature of the sample (only delete if it’s believed the score is not really from the population of interest (e.g. below average IQ when intended population is average IQ) o Change outliers so they are only one unit more extreme than the next most extreme score (e.g. change from 45 to 24 where next most extreme score is 23). o This ensures outlier still exerts influence, but no more than its fair share o Can transform the data or use nonparametric statistics (as these are based on ranks thus not affected by extreme scores, and do not require normality)

Once we have dealt with our outliers, what does that allow us?

Once outliers are dealt with the data may now meet the normality assumptions so parametric tests are possible

If we have treated our data, what should we do when writing our results?

It is essential to report succinctly that any treatment of outliers in the results section

What is the second step in the analysis process (yep, all that was just the first!)?

The second step in the analysis process is to decide which statistical test to use

In what circumstances would you use non-parametric tests?

Use non-parametric tests when: • The data (scores on the DV) are ordinal rather than interval or ratio • Sample sizes are small (<10, where it can be virtually impossible to conduct accurate assumption checks • initial data screening indicates that the parametric assumptions, particularly normality, are violated • sample sizes are small & unequal Non-parametric tests are an alternative to dealing with outliers as they are not distorted by them, however, non-parametrics are less accurate