lecture 1 - effect size and power Flashcards
what is null hypothesis significance testing
a method of statistical inference by which an experimental factor is tested against a hypothesis of no effect or no relationship based on a given observation
NHST is a statistical method for testing whether there is enough evidence in a data sample to infer that a particular condition or effect exists in the larger population, its a way to decide between two competing hypothesis : null and alternative
what is
-null hypothesis
-alternative hypothesis
Null Hypothesis ( ): This is the default assumption or claim that there is no effect, difference, or relationship in the population. For example:
“There is no difference in mean scores between two groups’
Alternative Hypothesis ( ): This is the competing claim that there is an effect, difference, or relationship. For example:
“The mean score of group A is greater than that of group B.
what is the rationale for null hypothesis significance testing
Researcher has a research question
Formulates a null hypothesis (there is no effect) and an alternative
hypothesis (there is an effect).
Collects data (sample from population)
type 2 error
-there is a difference byt you fail to detect it
if the data :
provides
does not provide
evidence against the null hypothesis
If the data provide sufficient evidence against the null hypothesis:
◼ Rejects the null hypothesis
◼ Adopts the alternative hypothesis instead
If the data does not provide sufficient evidence against the null hypothesis
◼ Rejects the alternative hypothesis
◼ But it does not necessarily mean that the null hypothesis is true.
problem with NH
-why NH is unrealistic in real world, impossibility for two groups to have the same score
-A null-hypothesis of H0: μa-μb=0 is a hypothetical construct
in the real world, it’s almost impossible for two groups to have exactly the same score. There will always be some tiny differences because of random chance or natural variation.
A non-significant result should never be interpreted as ‘no difference
between means’ or ‘no relationship between variables’.
If the test result isn’t significant, it doesn’t mean there’s absolutely no difference between the groups. It just means the difference is so small that, with the data we collected, we couldn’t be sure it wasn’t just random noise.
A non-significant result only tells us that the effect is not large enough to be detected with the given sample size. If we had a bigger sample (more data), we might be able to detect even small differences. A small sample might miss these subtle effects.
problems with NHST
-not possible to demonstrate the null hypothesis
Not possible to demonstrate the null hypothesis
a non-significant result could be due to the null-hypothesis being true OR a
failure to gather sufficient evidence
→ Researchers must set up their research so that the ‘desired’ outcome is to reject the null hypothesis
problems with NHST
-statistical significance is not practical significance
Statistical significance is not practical significance
with a sufficiently large sample, very small effects can become statistically
significant, although they may be unimportant for any practical purpose.
practical significance : a fictitious example
IQ is measured in >1000 participants
Statistical tests indicate that one gender has a higher IQ than the other
(p<0.05).
The actual difference in group means is 0.8 IQ points
Although the difference is statistically significant, it is practically irrelevant:
it is not informative of the IQ of any individual person, because the variance
within groups is much larger than the difference between groups
problems with NHST
All-or-nothing thinking
If p < .05 then an effect is significant, but if p > .05, it is not.
One would reach completely opposite conclusions depending on whether p
= .0499 or p = .0501.
However, these p-values only differ by 0.0002.
They would reflect basically the same-sized effect.
→ Alpha level is arbitrary (result: many published papers with values
just below 0.05)
what does significant mean
In statistics, ‘significance’ implies that something is unlikely to have
occurred by chance (and may therefore have a systematic cause)
What is considered to be ‘unlikely’ depends on an arbitrarily defined
significance threshold
Psychology: α=0.05 (= a 1 in 20 chance)
Physics: 5σ criterion (α=0.000000286), a 1 in 3.5 million chance
A critical perspective: significance at a 5% threshold indicates limited
evidence that the data is not entirely random
what are alternative to NHST
-no clear replacement currently available
-proposed : effect size
effect size
Provides an estimate of the size of group differences or the effect of
treatment
Ideally independent of the size of the sample
Effect size is a measure of the magnitude or strength of a difference or relationship in a study, beyond just whether it is statistically significant. While statistical significance tells us if an effect exists, effect size tells us how big or meaningful that effect is.
what are the uses of effect size
- Measure of how large an effect is (p- or t- or F-value will not tell this)
-used in estimating the sample size needed for sufficient statistical power
-used when combining data across studied (meta analysis)
types of effect size
Group difference indices (e.g., Cohen’s d)
Strength of association (‘variance explained’, e.g., eta squared, R
squared)
Risk estimates (e.g., relative risk)
effect size
-group differences
Examples:
Males versus females
Treatment versus control group
Young versus older participants
difference between population mean and sample means
population mean is normally unknown, so sample mean can be used to get a good approximation
how to use sample mean to get effect size
sample means : m1-m2
eg effect size = 180-165 = 15
what is a disadvantage of using differnce in means for effect size
Disadvantage: Measure is dependent on measurement scale
standardised mean difference
sigma
-we dont know the population means, but we can use the sample means
-what about sigma? - Various methods to estimate sigma, leading
to different effect size measures
group difference indices
-cohens d
-glass’ delta
-hedge’s d
Measures differ on how the population variance is estimated from the data
cohens d
-most commonly reported
SDpooled
SDpooled = root of ..
hedge’s g
-very similar to cohens d
-Measures differ on how the population variance is estimated from the data
Glass’ delta
Glass’ delta uses the standard deviation from the control group rather than the pooled standard deviation from both groups.
Glass’ delta is often used when several treatments are compared to
the control group.
paired samples t test
A paired samples
𝑡t-test (also called a dependent samples
𝑡t-test) is a statistical test used to compare the means of two related groups to see if there is a significant difference between them. The groups are “paired” because the same individuals or entities are measured twice under different conditions or at different times.
classification of effect size -cohens d
Classification of effect size:
d between 0.2 and 0.49 = small
d between 0.5 and 0.79 = medium
d of 0.8 and higher = large