open science Flashcards
Many Labs 1 (2014) - Investigating Variation in Replicability:
A ‘Many Labs’ Replication Project
• Took 13 classic and contemporary studies. Many of these were ‘known’ to replicate.
So this really was looking at the conditions that allow replicability.
• 36 attempted replications
• 10 of the findings replicated consistently – a decent amount
• A Narrow and deep approach – look at few findings and tried to replicate multiple times.
• Useful approach but not clear how generalisable across the field the results are…
Open Science collaboration (2015) –
Estimating the reproducibility of psychological science
• Replications of 100 experimental and correlational studies (by 270 people)
• Designed a protocol that needed to be followed
o Contacting original authors for materials
o Registering the protocol of the design, participant numbers and analysis
• Studies from 3 journals - Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition
• Used high powered designs
• Focused on the size of the replication effects relative to the published original effects
• Focuses on replication in term of number being statistically significant
Effect size
Effect size is a measurement of the magnitude of an effect.
It is important to know about the size of an effect e.g., how much does a experimental manipulation influence the data.
• It might not matter whether your effect is large or small. Both can be interesting (and a small effect spread over millions of people can have large impacts)
- But if an effect is small, you need to make sure you have designed an experiment well enough to stand a chance of showing it
- Publication bias (only publishing studies that have significant effects) might inflate the sizes of reported effect sizes …
effect size equation
effect size = mean of cond.1 - mean of cond.2 ---------------------------------------------------- pooled SD (SD of all cond1 + cond2 scores)
Effect size
Why can’t we use p values?
- p values indicate the probability of the observed data given that the null hypothesis is true
- p values are heavily influenced by sample size
- So we need a measure not influenced by sample size
- An example measure is shown here (there are different ways of calculating this)
- It uses Means and Variance
power
Once we have
a) A design (e.g., number of participants)
b) Knowledge of the Effect Size (from previous literature)
We can calculate ‘Power’ – which refers to the likelihood of getting a statistically significant result in this study given a) and b).
Typically in the behavioural sciences we might use a power of 80%.
So in the literature surrounding replication, you will hear a lot about Power
a) Replication studies should have the appropriate power to find the effect.
b) Issues with replication are often because the initial studies are often under powered. Combine with publication bias towards exciting findings, this means things getting published which might not be as reliable as they claim.
REPLICATION CRISIS: Open Science collaboration (2015) –
Estimating the reproducibility of psychological science
- Replication effect sizes were half the magnitude of original effect sizes representing a substantial decline
- 97% of original studies had statistically significant results. 36% of replications had statistically significant results
- 39% of effects were subjectively rated to have replicated the original result
- Correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams
REPLICATION CRISIS
Difference between social psychology and cognitive psychology (open science collab 2015)
- Cognition – Approximately 50% of findings were replicated (statistically significant)
- Social – Approximately 25% of findings were replicated
- Possibly due to
- Nature of study – social – concepts under study are influenced by a wide range of things
- Control – in cognition, much easier to control some of the variables
• First study of its kind – unclear as to what the replication rate should be.
There are always chances things don’t replicate
• Tried to maximise chance of replication - but things might not have gone as they should
• The finding that effect sizes seemed substantially inflated might be due to
systematic biases in publication
• Liking “significant” and “interesting” results
• This means authors not publishing when they fail to find significant effects.
• Low powered research designs
There are lots of reasons why you might not replicate findings
- Sample might differ in important ways
- The context (situation) might have an influence on the findings
- One of the studies made a mistake in the design
- One is a false positive or the other a false negative
Many Labs 2 (2018)
- More effects (28)
- Used effects that should change across sample settings
- More labs were involved - Each protocol was administered to approximately half of 125 samples that comprised 15,305 participants from 36 countries and territories
- More diverse samples were used.
- Rationale – to understand how sample type and situation might influence replication – if this introduces a lot of variability, then we can say that replication attempts themselves are problematic.
• On average 14 out of 28 – significant effects in the same direction
• 8 of these found the effects in 89% to 100% of the different samples
• 6 of these found the effects in 11% to 46% of the different samples
• Out of those that did not replicate on average – 90% of the samples showed
non-significant effects
How did situation influence the findings?
• 11 of the 28 showed significant heterogeneity across samples (differences) –
but only one of these was in an effect that did not replicate
• Only 1 effect showed a difference between online and in the lab versions of the test
• Identified samples in WEIRD cultures - Western, educated, industrialized,
rich and democratic- and compared to samples from less WEIRD cultures
• 13 out of the 14 that replicated showed no difference between WEIRD and less WEIRD.
The one that did found the effect in the WEIRD sample but not the less WEIRD sample
• In total there were only 3 differences found between WEIRD and less WEIRD
• Explored task order (as each lab did multiple tests) and found no real effect of this
• Overall 7 of the replications had larger effect sizes than originally reported, 21 had smaller
effect sizes
• Conclusion – Although the situation does have an effect, they are not large enough to
explain failures to replicate
Replication crisis
Forsell et al (2019)
Interestingly – academics are quite good at predicting replication success …
• In this study academics had to predict the success of replication from Many Labs 2
• Via a questionnaire – 0.731 correlation with replication outcome
• Via a prediction market (where psychologists trade on outcome) = 0.755 correlation with replication outcome
• Via a questionnaire on predictions of effect sizes – 0.614 correlation
• So academics have a decent understanding od when a study may or may not replicate? How and why? Potentially if result is “surprising” or very novel …
• There are studies I have read and been quite sceptical of, either because the result seemed a bit surprising or unlikely, the sample looked small, the design didn’t seem great etc.
the situation
- Many fields (not just Psychology) are having an issue in relation to replication, for many reasons outlined.
- Statistical procedures
- Publication process
- Under powered studies
a potential solution is open science
open science
- This is a movement to make parts (and hopefully all) of the research openly available to all for scrutiny
- Until recently all we have ever seen is the end result – the journal article
- Being ‘Open’ should reduce some of the systematic issues that have led to the replication crisis
open science - why share
Sharing study rationale, hypotheses and plan for analysis before data collection somewhere that is openly accessible (and where this is time stamped) fixes the authors to one ‘story’ and doing one set of analyses.
• This prevents people making up a theory to explain the data
• It stops people analysing the data in lots of different ways until they find something ‘significant’. If we run lots of different analyses then chance alone will lead to one of them being ‘significant’. If that significant result is reported and written up to make it sound like it was the only analysis run it is convincing in the paper, but might mean it is not replicable.
• This type of statistics is called ‘null hypotheses significance testing’.