lecture 12 - between- subjects t-tests assumptions, effect sizes, confidence intervals Flashcards
Assumptions of a within-participants t-test: the short version
- Random and independent samples
- Normally distributed “something or other” (Field)
- Formally it’s that the sampling distribution of the means (the mean difference scores) is approximately normal….
- If n is large (n>30 or so) this is very likely to be reasonably true (thanks to the central limit theorem)
If n is small, then look at distribution of the data themselves (e.g. a histogram). If it looks fairly normal, you’re probably ok (unless people’s lives are at stake…). But if not (e.g. it’s asymmetric or not very “mound-shaped”) … worry…. Worry more for really small n….
Additional assumption of a between-participants t-test
- Equal population variances
- In practice, look to see if the variances of the two samples are about equal…
If you are uncertain if you have satisfied these assumptions …worry … think hard, check a textbook e.g. Field, etc.
If you haven’t satisfied these assumptions, don’t pretend you have….
find an alternative!!!! (Field gives lots).
A final worry/”assumption”: Check your data for outliers, e.g. extreme data points that are a long way from most of the data. Think hard if you’ve got extreme outliers…. worry….
normality assumption
Q-Q plots tell you how much data you should have at a particular value us how much data you do have
Field gives alternatives for when the normality assumption is suspect, particularly in terms of what are called “robust” tests. Essentially what these test do is relax the assumption of normality by doing a lot of computations based on the assumptions that the population distributions look like the (presumably nonnormal looking) sample distributions
equal variances assumption
if the n’s of the two groups are about the same, you don’t need to worry
if the ratio of variances is less than two, probably ok
Levine’s test is relevant, e.g. if the research question is fundamentally about assessing a potential difference in the variances of two conditions. For example, a researcher might be interested in comparing two different antidepressant medications to see of the therapeutic effect of the one medication is more consistent (less variable) than the other (even though their mean therapeutic effects are the same).
can report value from spas of equal variances not assumed
between -subjects effect size - how large it the effect relative to the variability?
do cohens d = N1 - N2 / √sp
cohens rule of thumb -
small effect size 0.2
medium effect size 0.5
large effect size 0.8
Significance and effect size are different:Between-subjects t-test and effect size
t = Y-bar1 - Ybar2/ √Sp^2/ N1 + Sp^2/ N2
cohens d = Y-bar1 - Y-bar2 / √Sp
The top part, the difference between the means, is the same for both.
But the bottom part is different: t includes dividing by the number of participants in the conditions while d̂ isn’t.
So significance is related to sample size.
two-condition between-subjects Confidence Intervals seeing statistical significance
if the overlap between the two sets of error bars is more than half a bar, so these results won’t be significant, p < 0.05, if n’s are equal….
But if the overlap is less than half a bar… significant… p < 0.05, if n’s are equal
It happens that for this particular example that the error bars based on standard deviation are very similar to those based on the 95% confidence interval in that they both don’t quite include zero. That happens to be the case here, but it is definitely not true in general. A standard deviation and a confidence interval are specified different and imply different things.
Note. A small sample 95% confidence interval as x_bar +/- t(n-1)*SE is similar to but not the same as the interval based on the actual calculated to value because t = 2.825 corresponds to p = 0.037 not to p = 0.05. So plugging t = 2.825 into the formula for the 95% confidence doesn’t exactly duplicate the confidence interval that spss reports, 0.4476 to 9.55025 because that interval is based on a value of t that exactly corresponds to 0.05 two tailed
null hypothesis significance testing
Null hypothesis: e.g. no difference between conditions
Significance testing: Assess the probability of a statistic like the t statistic from the perspective of assuming the null hypothesis is true.
Outcome of significance testing:
Not significant
fail to reject the null hypothesis
It might be true, or there is a real effect but the study missed it.
Significant
reject the null hypothesis
The observed statistic is unlikely according the null hypothesis (in the tail of the distribution).
Limitations of Null Hypothesis Significance Testing (NHST) 1
Possible reasons for the nonsignificant result:
There is no difference in reality
(on average people like tulips and roses the same)
OR
There is a difference in reality
(people on average really do like tulips better than roses),
but your experiment didn’t detect it, e.g. because it had low “power” (or you were unlucky….).
Possible solutions: formally assess power, the probability of detecting and effect of a given size if it exists (a future lecture) or do Bayesian statistics.
Limitations of Null Hypothesis Significance Testing (NHST) 3
- Categorical thinking (Field’s example): Why should p = 0.0499 lead to a substantially different conclusion (“significant!”) than p = 0.0501 (“not significant ”)?!
Original conclusion: “Sig O happiness was significantly affected by flower type in a within-subjects design, that is, happiness was significantly higher for tulips than roses, t(5) = 2.825, p = 0.037”, two-tailed.
But suppose we had: “Sig O happiness was not significantly affected by flower type in a within-subjects design, that is, happiness was not significantly higher for tulips than roses, t(5) = 2.344, p = 0.066”, two-tailed.
Had the first tulips condition number for the first participant been slightly smaller–18 rather than what it was, 23–then p = 0.066. But would that data really support a completely different research conclusion?
This is one reason why visualising data and forming an intuitive conclusion is so important.
Data visualisation helps avoid this NHST problem with categorical thinking
Limitations of Null Hypothesis Significance Testing (NHST) 3
- A “significant” effect does not tell you how big or important the effect is:
A large sample can make a tiny difference “significant” even though its practically uninterest/not useful.
A small sample (that is significant) might indicate a large and/or important effect.
Reporting effects sizes helps to address this problem but even they need to be in terms of the practical implications of the research.
Bayesian statistics
There are “Bayesian” alternatives for most statistical tests (in SPSS, etc.), e.g. Bayesian between-subjects t-tests.
A key potential advantage of these tests is they can sometimes “support the null hypothesis” in a way the NHST’s can’t.
They are also less susceptible to the categorical thinking problem with NHST’s.
How do they work?
They require the researcher to provide more information in terms of their prior beliefs about the ways the world might be, particularly for the alternative hypothesis…..
assumptions of the t-test
both the indepedent t-test and paired samples t-test are parametric tests so are prone to sources of bias
for paired samples t-test the assumption of normality relates to the sampling distribution of the differences between scores, not the scores themselves.
there are variants of these tests that overcome all of the potential problems
effect size for two independent means
if t-statistic is not statistically significant doesn’t mean the effect is unimportant so we quantify it with an effect size.
we can convert a t-value into an r-value using the equation
r = √ t^2/ t^2 + df
the effect can be non-significant but represent a fairly substantial effect
we could also compute cohen’s d using the two means and standard deviation pf the control group
the answer = standard deviations difference between two groups
the independent t-test
it compares to means, when those means have come from different groups of entities
if in sig column value is less than 0.05 the mean of the two groups are significantly different
ignore column levenes test for equality of variance and look at the row in the table labelled equal variances not assumed
overview of assumptions
assumptions = a condition that ensures that what you’re attempting to do works
if assumptions when we use a test statistics are true we can take the test statistic at face value but if they are not true then the test statistics and p-value will be inaccurate and lead us to the wrong conclusion
statistical procedures are unique tests with idiosyncratic assumptions but most of these procedures are variations of the linear model they share a common set of assumptions.
these assumptions relate to the quality of the model itself and the test statistics used to assess it (usually parametric tests based on normal distribution). main assumptions are:
- additivity and linearity
- normality of something or over
- homoscedasticity/homogeneity of variance
- indepedence