Reproducibility Crisis Flashcards
Brian Wansink
- experiments on eating behaviours
- abused statistical procedures to look like research was successful
- p-hacking and HARKing
What is ‘Crisis of Reproducibility’?
published research can’t be replicated/reproduced
- misuse of statistics
- not just about statistics
Why researchers use statistics?
- to find relationships between variables if they think they are linked
- ignore noise and try find relationships that hold true ‘in general’
What hypothesis do researchers test for?
null hypothesis / statistical tests that estimate how well the data supports the null hypothesis
- rarely test if a relationship exists but test to see if no relationship exists
what is the null hypothesis?
hypothesis that no relationship (statistical significance) exists
What is the p-value?
- probability that null hypothesis is true (probability that results are due to chance)
- lower p-value –> less likely null hypothesis is true; more reasonable to reject null hypothesis
- higher p-value –> more likely null hypothesis is true (no relationship exists); accept null hypothesis
What would be the null hypothesis for study testing to see if there’s an extra $1000 annual income per year of schooling?
- there is no relationship between years of schooling and income
What does a p-value of 0.55 mean?
- 55% chance that null hypothesis is true / 55% chance that there’s no relationship between years of school and income
- expect a statistical test this extreme 55% of the time
What does a p-value of 0.01 mean?
- 1% chance that null hypothesis is true
- very unlikely that there’s no relationship between 2 values
- very unlikely to get p-value this extreme
Most common cut off for statistical significance
p < 0.05
- researchers only incorrectly reject null hypothesis 5% of the time (1 in 20)
What is the issue with p-value cut off being <0.05?
- p<0.05 so 1 in 20 (5%) chance that null hypothesis is true
- so for every 20 tests run, you will incorrectly reject null hypothesis 1 time (null hypothesis is true in in 20 times)
What is p-hacking?
- Repeating a statistical test to get false positives / false statistically significant results
What is HARKing?
Hypothesis after results are known
- can’t collect data then frame hypothesis around data
What is the risk of a false positive with the accepted p value for a statistically significant result
- 1 in 20 chance of false positive (rejecting null hypothesis when it is true and there’s actually no relationship)
Examples of why research can be wrong
- small sample size
- publishing studies with small effects
- relying on a small number of studies
- generating new hypothesis to fit data
- flexibility in research design
- intellectual bias
- conflict of interest
- competition to produce positive results
Is flexibility in research design good?
NO - shouldn’t change research design to fit data
Why does so much research fail to replicate?
- bad method
- researchers not constrained
- culture in research to get new and exciting results
Example of how researchers edit their data
- exclude outliers from analysis
- p-hacking
- HARKing
- stopping collecting data when they achieved their desired results
- looking for effect in subgroups instead of whole populations
How do researchers get away with this sloppy science
- they have freedom in research
- ## they justify altering methods midway as flexibility
Can researchers justify altering research design before it’s complete
not good but sometimes there’s good reasons, e.g. :
- study causes harm
- trial working (can’t deny control group)
Why is it bad that researchers don’t share their data/methods
can’t be checked and critiqued
- privacy restrictions / archive fails
What effect does researchers wanting positive results have on fellow research?
publication bais towards pos results:
- studies aren’t replicated
- neg results turned into pos results
- neg results aren’t publishes
What can be done to improve statistical use in research?
- better training
- p < 0.01 (null hypothesis incorrectly rejected 1 in 100 times)
- confidence intervals instead of significance testing
- make raw data available