Lecture 11 - Significance and Power Flashcards
Basic principles of Null Hypothesis Significance Testing
Assume H0 is true -> fit a model to data, get a test statistic -> calculate the probability of getting test statistic, assuming H0 is true (p)
Get test statistic by comparing amount of ‘signal’ to ‘noise’, or ‘systematic variation’ to ‘unsystematic variation’, or ‘effect’ to ‘error’
The misuse of NHST
Sometimes misused intentionally, sometimes unintentionally
The American Statistical Association (2016) outlined principles on the misuse of p values in significance testing, including:
(1) p-values are not measuring the probability of getting results by chance, or that a specific hypothesis is true (likelihood we get specific statistic if we assume null hypothesis is true in the first place)
(2) Statistical significance is not the same as practical importance
(3) The p-value alone is not a good measure of evidence regarding a model of hypothesis
Type I and Type II errors
Null hypothesis true in reality (population) and experiment result (sample) = accurate (unfortunate)
Null hypothesis true in reality (population) but alternative hypothesis is true in experimental result = type I error α (false positive)
Alternative hypothesis is true on population, but null hypothesis is true in experimental result = type II error β (false negative)
Alternative hypothesis true in population and sample = accurate
Power
P-value doesn’t tell you how likely you found a genuine effect (just probability of getting a particular test statistic)
The probability of finding an effect assuming one exists in the population
Calculated as 1-β
β is the probability of not finding the effect, and is usually 0.2 (Cohen, 1992)
We have an 80% chance of detecting an effect assuming it genuinely exists
Factors affecting power
Knowing three factors means you can figure out the 4th one
E.g. if know level of power want, alpha level and effect size, know how many participants needed
Effect size
An objective and standardised measure of the magnitude of an effect
Larger value = bigger effect size
Depends on test conducted (Cohen’s d – t-tests, Pearson’s r – correlation, Partial eta squared – ANOVA)
The American Statistical Association (2016) recommends reporting this in results sections of reports
Look at previous research with effect sizes to know how many participants etc.
Number of participants
Rule of thumb: more participants = more ‘signal’ (because there is less sampling error), less ‘noise’
Less room for sample error
Should choose sample size depending on expected effect size
Larger effect size = fewer participants needed to get a ‘real’ effect
Smaller effect size = more participants needed to detect a ‘real’ effect
Alpha level
Size of alpha = probability of obtaining a type I error
Compare p value to this alpha criteria when testing significance
If p value is less than alpha criteria, results are significant
E.g. set alpha of .05 and run study 100 times, expect to make a type I error 5 times
Trade off – if you want to decrease type I error rate, you naturally increase type II error rate, and vice versa
Choice of alpha depends on specific research area/previous research
Many studies/research use .05
Other factors
Variability, design, test choice
Problems with alpha testing
If we run multiple tests, this will increase the rate at which we might get a type I error, also known as a Familywise experimental error rate
Can account for this by limiting the number of tests, or by using corrections such as Bonferroni correction
Dividing original alpha criterion by number of comparisons we want to make
But this reduces statistical power
One-tailed test
We hypothesise there will be a difference in scores, and we’re specific about which score will be higher (α=.05 at one end). Directional hypothesis.
Two-tailed test
We hypothesise there will be a difference in scores, but this could be in either direction (α= .025 at both ends). Non-directional hypothesis.
This impacts data interpretation because for a one-tailed hypothesis/test, our p-value is half of the two-tailed p-value
Argued that one-tailed test is more powerful as it is more likely to detect a significant effect
Why does our p-value change?
Problem with one-tailed tests: if you obtain a significant test statistic but it was in the other direction, you must not reject the null hypothesis (can encourage cheating)
Two-tailed hypothesis = assesses likelihood in both directions
Which type of test do I run?
One-tailed tests are more powerful as α is higher
One-tailed = larger alpha = less likelihood of making a type II error
However, there are several caveats and considerations to this
Can allow people to cheat in research and analysis (p-hacking)
In most cases, it is recommended that you run a two-tailed test so we can explain results in either direction they go in
Power and study design
Within-subjects studies are more powerful than between-subjects studies
But depends on type of study being conducted
Why is statistical power an important concept in inferential statistics?
Might want to do two things:
Calculate the power we have obtained in a study post-hoc (use no. of participants, effect size and alpha level)
Calculate how many participants we need to collect for a study a priori (this can be done using statistics programs such as G*Power)
Cannot calculate power using SPSS