Normality, Bias & Transformations Flashcards
Assumptions of parametric tests
PT based on normal distribution assume= additivity & linearity, normality, homogeneity of variance & independence
Normal distribution
Relevant to parameters, confidence intervals around a parameter & null hypothesis significance testing
This assumptions tends to get incorrectly translated as ‘your data needs to be normally distributed’ but not the full story
When does the assumption of normality really matter
Small samples (central limit theorem allows us to forget about this assumption on larger samples)
As long as sample is fairly large, outliers are greater concern than normality
Once you get 50 pp, sample looks ND even if data isn’t really
Parametric testing & outliers
Most parametric testing (estimating parameters) based on stats like means & SD- means they’re heavily biased by outliers
EDA takes account of outliers by using robust methods, also emphasises visualising & studying the data on its own terms to see what’s actually going on
Homogeneity of variance/homoscedasticity
When testing several groups of pp, samples should come from pops with the same variance
Can affect parameters & null hypothesis significance testing
In correlational designs, variance of the outcome variable should be stable at all levels of predictor variable
Assessing homoscedasticity/ homogeneity of variance
1) Levene’s tests= tests if variance in different groups are same (significant if variances not equal)
2) variance ratio= 2 or more groups, VR= largest variance/smallest variance, if VR<2, homogeneity can be assumed
3) graphs
Ways of spotting normality
1) Kolmogorov-Smirnov/ Shapiro-wilk test= test if data differ from normal distribution (if significant=non-normal data)
2) graphical displays= p-p plot (normal if points fall close to line), histogram/stem & leaf plot
3) values of skew/kurtosis= will both be 0 in ND
Don’t need to worry about these checks in large samples (hundreds) due to central limit theorem, analyses more robust to ND violations
Kurtosis
A measure of whether there are non-smooth spikes at particular places in the distribution
You want your skewness/kurtosis scores to be less than 1 or (-1)
Score of 0 suggests data is ND
Transforming data
1) log transformation= reduces positive skew, also known as natural log (ln)
2) square root transformation= reduces positive skew, useful for stabilising variance
3) reciprocal transformation= dividing 1 by each score reduces impact of large scores, reverses the scores, can avoid this by reversing scores before transformation
Worth trying a few different transformations then choosing the one that looks the best
Make sure transformation is worthwhile
Cautious against transforming
Hinders the accuracy of F
Transforming the data changes the hypothesis being tested
In small samples, difficult to determine normality one way or another
Consequences for the statistical model of applying the wrong transformation could be worse than consequences of analysing untransformed scores