NHST Flashcards
sampling variation
every sample drawn at random from a population will be composed of different individuals and therefore have different means
- difference between means is termed sampling variation or sampling error
two samples drawn from a single population may occasionally have quite different means
- therefore possible to obtain a statistically significant t result, even though the two samples are from the same population
- false positive result, type 1 error
two samples drawn from quite different populations may occasionally have quite similar means
- possible to obtain a non-sig result, even though there’s a real difference between the two pops
- false negative result, or type 2 error
type 1 error
false positive results
type 2 error
false negative results
error is unavoidable
NHST generates a p value, which is probability that we would have obtained our observed data if H0 is true
if p= 0.02, there’s only a 2% probability that we would have obtained our data if H0 is true, so we reject H0
if p= 0.1, there’s a 10% probability that we would have obtained our data if H0 is true, so we fail to reject H0
alpha and beta
alpha (false positive): acceptable probability of a type one error
- usually alpha= 0.05
- accept we will make a type 1 error up to 5% of the time
beta (false negative): acceptable probability of a type 2 error
- usually b= 0.20
- make a type 2 error up to 20% of the time
significant result
likely to lead to follow up studies, investment of considerable time and money which is wasted if based on type 1 error
false negative result
- interesting results get missed
- less serious
visualizing alpha
NHST: directly tests the null hypothesis, not the alternate
H0: single testable numerical predication
H1: doesn’t predict a single testable numerical prediction
- infinite values for which =/ 0 is true
t statistics in the trial are unlikely to be obtained if H0 is true
- if H0 is true, and observed data is in the tail= type 1 error
visualizing beta
generate a distribution of t statistics that would be obtained if H1 is rue
have to pick a specific size of the effect that we are expecting
effect size
Cohen’s d: continous data consisting of two groups (T TESTS)
eta squared: continuous data consisting of >2 groups (ANOVA)
partial eta squared: continuous data with >1 predictor variable (factorial ANOVA or multiple regression)
Pearson’s r: relationship between two continuous variables (correlation or regression)
R^2: continuous data with a continuous or categorical predictor (correlation, regression or ANOVA)
odds ratio (OR): categorical data (uses X^2 or logistic regression)
general properties of effect size
quantifies the size of the effect of the predictor variable on the outcome variable (effect of x on y)
effect size is generally not affected by sample size
- large sample sizes do increase the probability of a statistically significant result
- large sample sizes do not systematically affect the effect size
the larger the effect size associated with the predictor variable, the easier it will be to obtain a statically-significant result
- probability of a false negative error will be reduced
Cohen’s d
simpler measure of effect size, used for continuous data consisting of two group
- appropriate for data analyzed using paired and independent t-tests
expresses the difference between group means as the number of standard deviations between the means
Cohen’s d: repeated measures
expressed the difference between condition means (D-bar) as the number of standard deviations (Sd) between the means
d= D-bar/ Sd
Sd: average value of Di- D-bar
- average residual or error from GLM
- if d=2, the difference between the conditions is twice the average error or residual
Cohen’s d: independent groups
difference between group means (y-bar1 - y-bar0) as the number of standard deviations (Sp) between the means
d= (y-bar1 - y-bar0)/ Sp
Sp: pooled standard deviation and is the average difference between each score (y1 or y0) and the group mean
- average residual or error from GLM
- if Sp=2, the difference between the group means is twice as large as the average error/residual
impacts on Cohen’s d
greater between difference of means = the greater Cohen’s D
Sp gets smaller= Cohen’s d gets larger
interpreting Cohen’s d
small effect: d<0.2
medium effect: 0.2 <d<0.8
large effect: d>0.8
Cohen’s d vs t
d: divides by the average difference between each score and the mean (unaffected by n)
t: divides by the average difference between each mean of the sampling distribution and the distribution mean (affected by n)
visualizing beta
calculate the t distribution based on H0 (shows alpha)
- gives range of t stats that we would expect if H0 was true
randomly sample scores per group from two normally-distributed population
then calculate t statistic
repeat million times to generate distribution that would be expected if H1 was true with d= + 0.8
power
beta: probability of obtaining a negative result where H1 is true
- statistical analyses focus on power vs beta
- probability of a true positive result
aim for beta<0.2, power > 80%
3 ways to increase power to be 0.8
- make it easier to obtain a sig result by reducing alpha
- reduces the probability of a false negative, but will increase probability of false positive result - increase sample size
- reduce the standard error, and therefore the standard deviation of our probability distributions - change H1 by increasing expected effect size
changing n impact
increasing n reduces standard error
- result in larger value of t ( if mean of distribution isn’t 0)
- mean of beta distribution will be shifted away from alpha distribution
changing n, changes df and alters tcrit
mainly shifts the mean of beta distribution away from alpha
- contribute to increased power
2 explanations for a non-significant result
- no effect of the manipulation
- there is an effect of manipulation, but effect no effect detected due to weak effect size, low power or bad luck
power calculations in R
pwr.t.test (n=12, d=0.8, sig.level= 0.05, power=NULL, type=”two.sample”)
n: number of observations
d: effect size
sig.level: significance level (type 1 error probability)
type: type of t test (one or two)
power: power of test
thresholds of statistical significance (alpha)
without a threshold, there is no type 1 or 2 error
p hacking due to thresholds
any form of data manipulation in order to get results where p<0.05
- conduct multiple statistical analyses and only admit to performing the analyses that produced sig results
- deciding to remove an outlier to generate sig result
- remove an entire group to generate sig result
- select a different statistical test to generate sig result
publication bias due to thresholds
less likely to be accepted to publication with entirely negative results
how did we adopt a threshold of significance?
using alpha to convert a continuous probability value into a binary decision results in type 1 and 2 error, leading to p-hacking and publication bias
threshold of sig does make it easier to communicate scientific findings, especially to a lay audience
Karl Pearson
founder of mathematical statistics
Pearson’s r
Pearson’s X^2 test
p value
William Sealy Gosset
developed t distributions (Student’s t distribution)
statistical work was developed to improve methodologies for brewing Guinness
Company policy prevented him publishing under his own name, so adopted Student
Ronald Fisher
-Developed ANOVA (F test after Fisher)
- formalized concept of null hypothesis, and stat test of H0
- formalized use of p values to evaluate H0
end point of NHST was p value
- argued against using a threshold for stat significance
Jerzy Neyman and Egon Pearson
developed concept of alternative hypothesis
calculated prob of H0 being true, and prob of H1 being true
comparing two probabilities, selected which hypothesis was more likely
argued their approach was better as it evaluated two competing probabilities to identify which was most probable
fisher argued that applying binary decision would lead to confusion
best practices in NHST
know how to interpret results
- sig results may be false positive esp if p= 0.05
- understand nonsig results may be false neg esp if sample size is small or effect size is small
always report effect sizes to contextualize both sig and non-sig results
plan analysis in advance to avoid p-hacking
replicate findings esp if findings are critically important to future research but are only marginally sig
apply meta analyses to research questions
- reanalyze data from multiple related publication in attempt to resolve the apparent contradictions
consider alternatives to NHST
- Bayesian stats