8 the replication crisis and the open science movement Flashcards
Where did the idea of a replication crisis come from?
studies found that
The mean effect size (r) of the replication effects (Mr= 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr= 0.403, SD = 0.188), representing a substantial decline.
Ninety-seven percent of original studies had significant results (P< .05). Thirty-six percent of replications had significant results
Why do we have a replication crisis?
problematic practices: selective reporting, selective analysis, insufficient specification of the conditions necessary or sufficient to obtain the results
publication bias, …
⇒ understanding is achieved through multiple, diverse investigations
replication just means evidence for the reliability of a result
alternative explanations, … can account for diminished reproducibility
⇒ cultural practices in the scientific communications
low-power research designs
publication bias
⇒ Reproducibility is not well understood because the incentives for individual
scientists prioritize novelty over replication
What is predictive of replication success?
good strength of initial evidence rather than characteristics of the team conducting the research
What is “evaluating replication effect against null hypothesis of no effect”?
does replication show statistically significant effect within the same direction as the original study
treating 0.05 treshold as a bright-line criterion between replication success and failure is key weakness of this method
What is done if you evaluate the replication effect against the original effect size?
is the original effect size withing the 95% CI of the effect size estimate from the replication
-> precision of effect, not only direction
-> size, not only direction
What is done if you compare original and replication effect sizes for cumulative evidence?
descriptive comparison of effect sizes - does not provide info about the precision of either estimate or resolution of the cumulative evidence for the effect
→ computing meta-analytic estimate
One qualification about this result is the possibility that the original studies have inflated effect sizes due to publication, selection, reporting, or other biases
Is replication the real problem?
meta-analyses show - most findings are being replicated
The real problem is not a lack of replication; it is the distortion of our
research literatures caused by publication bias and questionable research practices.
What do the researchers argue for? what is the real problem in psychological research?
(a) studies in most areas are replicated;
(b) failure to replicate a study is usually not evidence against the initial study’s conclusions;
(c) an initial study with a nonsignificant finding requires replication;
(d) a single study can never answer a scientific question;
(e) the widely used sequential study research program model does not work;
(f) randomization does not work when sample sizes are small.
What different types of replication exist?
(a) literal replication—the same researcher conducts a new study in exactly the same way as in the original study;
(b) operational replication—a different researcher attempts to duplicate the original study using exactly the same procedures (also called direct replication); and
(c) systematic replication—a different researcher conducts a study in which many features of the original study are maintained but some aspects (e.g., type of subjects or measures used) are changed (also called conceptual replication)
What are common errors in thinking about replication?
- replication should be interpreted in a stand-alone manner
ignores statistical power
average statistical power in psychological literatures ranges rom .40 to .50
(the likelihood that a test will detect an effect of a certain size if there is one)
Note that if confidence intervals (CIs) were used instead of significance tests, there would be far fewer “failures to replicate”— because the CIs would often overlap, indicating no conflict between the two studies - research in meta-analysis has shown no single study can answer any question
sampling error = the difference between an estimate of a population parameter and the actual value of the population parameter that the sample is intended to estimate - measurement error, range variation, imperfect construct validity of measures, artificial dichotomization of continuous measures, and others
What about replicability of non-significant findings?
= usually the absence of a relationship
→ unjustified
→ do nsf´s not need replication?
⇒ should be followed up with additional studies
In fact, given typical levels of statistical power, a relation that shows consistent nonsignificant findings may be real.
Richard (2003) - average effect size in social psychology is d = .40
(based on >300 meta-analyses)
median sample size in psychology is only 40
-> usually half should report significant, half non-significant findings
-> not the patten we see
What are biases in the published literature?
- research fraud
- publication bias, source bias
- biasing effects of questionable research practices -> QRPs (most severe in laboratory experimental studies)
highest admission in social psychology, 40%
(a) adding subjects one by one until the result is significant, then stopping;
(b) dropping studies or measures that are not significant;
(c) conducting multiple significance tests on a relation and reporting only those that show significance (cherry picking);
(d) deciding whether to include data after looking to see the effect on statistical significance;
(e) hypothesizing after the results are known (harking); and
(f) running a lab experiment over until you get the “right” results.
- limitations of random assignment
(claimed superiority of experimential studies)
randomisation does not work if the samples are not large - extremely rare
small randomized sample sizes produce neither equivalent groups nor
groups representative of the population of interest
What approach should be taken to detect QRPs?
The frequency of statistical significance in some literatures is suspiciously high given the level of statistical power in the component studies
statistical power has not increased since Cohen first pointed it out in 1962
low power → nonsignificant findings → difficult to publish
avoiding this consequence by using QRPs
upward bias in mean effect sizes and a downward bias in the variability across effect sizes due to the unavailability of low-effect-size studies
What is false-positive psychology research practice?
despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05),
flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates
false positive = incorrect rejection of a null hypothesis (detecting true differences when there are none)
What are researcher degrees of freedom?
- it is common for researchers to explore various analytic alternatives and report only “what worked”
ambiguity in how to best make a decision
desire to find statistically significant results
→ self-serving justifications
(highly subjective, variable across replications)
- flexibility in choosing among dependent variables
- choosing sample size
- using covariates
- reporting subsets of experimental conditions
What can be said about the influence of this flexibility on false-positive rates?
⇒ flexibility in analyzing two dependent variables (correlated atr= .50) nearly doubles the probability of obtaining a false-positive finding
⇒ adding 10 more observation until the findings are significant doubles the probability as well
=> controlling for gender or interaction of gender with treatment produces fpr of 11.7%
⇒ combination of all practices would lead to a false positive rate of 61%