8 the replication crisis and the open science movement Flashcards

1
Q

Where did the idea of a replication crisis come from?

A

studies found that

The mean effect size (r) of the replication effects (Mr= 0.197, SD = 0.257) was half the magnitude of the mean effect size of the original effects (Mr= 0.403, SD = 0.188), representing a substantial decline.

Ninety-seven percent of original studies had significant results (P< .05). Thirty-six percent of replications had significant results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we have a replication crisis?

A

problematic practices: selective reporting, selective analysis, insufficient specification of the conditions necessary or sufficient to obtain the results

publication bias, …

⇒ understanding is achieved through multiple, diverse investigations
replication just means evidence for the reliability of a result
alternative explanations, … can account for diminished reproducibility

⇒ cultural practices in the scientific communications
low-power research designs
publication bias

⇒ Reproducibility is not well understood because the incentives for individual
scientists prioritize novelty over replication

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is predictive of replication success?

A

good strength of initial evidence rather than characteristics of the team conducting the research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “evaluating replication effect against null hypothesis of no effect”?

A

does replication show statistically significant effect within the same direction as the original study

treating 0.05 treshold as a bright-line criterion between replication success and failure is key weakness of this method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is done if you evaluate the replication effect against the original effect size?

A

is the original effect size withing the 95% CI of the effect size estimate from the replication

-> precision of effect, not only direction

-> size, not only direction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is done if you compare original and replication effect sizes for cumulative evidence?

A

descriptive comparison of effect sizes - does not provide info about the precision of either estimate or resolution of the cumulative evidence for the effect

→ computing meta-analytic estimate

One qualification about this result is the possibility that the original studies have inflated effect sizes due to publication, selection, reporting, or other biases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is replication the real problem?

A

meta-analyses show - most findings are being replicated

The real problem is not a lack of replication; it is the distortion of our
research literatures caused by publication bias and questionable research practices.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What do the researchers argue for? what is the real problem in psychological research?

A

(a) studies in most areas are replicated;
(b) failure to replicate a study is usually not evidence against the initial study’s conclusions;
(c) an initial study with a nonsignificant finding requires replication;
(d) a single study can never answer a scientific question;
(e) the widely used sequential study research program model does not work;
(f) randomization does not work when sample sizes are small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What different types of replication exist?

A

(a) literal replication—the same researcher conducts a new study in exactly the same way as in the original study;

(b) operational replication—a different researcher attempts to duplicate the original study using exactly the same procedures (also called direct replication); and

(c) systematic replication—a different researcher conducts a study in which many features of the original study are maintained but some aspects (e.g., type of subjects or measures used) are changed (also called conceptual replication)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are common errors in thinking about replication?

A
  • replication should be interpreted in a stand-alone manner
    ignores statistical power
    average statistical power in psychological literatures ranges rom .40 to .50
    (the likelihood that a test will detect an effect of a certain size if there is one)
    Note that if confidence intervals (CIs) were used instead of significance tests, there would be far fewer “failures to replicate”— because the CIs would often overlap, indicating no conflict between the two studies
  • research in meta-analysis has shown no single study can answer any question
    sampling error = the difference between an estimate of a population parameter and the actual value of the population parameter that the sample is intended to estimate
  • measurement error, range variation, imperfect construct validity of measures, artificial dichotomization of continuous measures, and others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What about replicability of non-significant findings?

A

= usually the absence of a relationship

→ unjustified
→ do nsf´s not need replication?

⇒ should be followed up with additional studies

In fact, given typical levels of statistical power, a relation that shows consistent nonsignificant findings may be real.

Richard (2003) - average effect size in social psychology is d = .40
(based on >300 meta-analyses)

median sample size in psychology is only 40

-> usually half should report significant, half non-significant findings
-> not the patten we see

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are biases in the published literature?

A
  • research fraud
  • publication bias, source bias
  • biasing effects of questionable research practices -> QRPs (most severe in laboratory experimental studies)

highest admission in social psychology, 40%

(a) adding subjects one by one until the result is significant, then stopping;

(b) dropping studies or measures that are not significant;

(c) conducting multiple significance tests on a relation and reporting only those that show significance (cherry picking);

(d) deciding whether to include data after looking to see the effect on statistical significance;

(e) hypothesizing after the results are known (harking); and

(f) running a lab experiment over until you get the “right” results.

  • limitations of random assignment

(claimed superiority of experimential studies)

randomisation does not work if the samples are not large - extremely rare

small randomized sample sizes produce neither equivalent groups nor
groups representative of the population of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What approach should be taken to detect QRPs?

A

The frequency of statistical significance in some literatures is suspiciously high given the level of statistical power in the component studies

statistical power has not increased since Cohen first pointed it out in 1962

low power → nonsignificant findings → difficult to publish

avoiding this consequence by using QRPs

upward bias in mean effect sizes and a downward bias in the variability across effect sizes due to the unavailability of low-effect-size studies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is false-positive psychology research practice?

A

despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05),
flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates

false positive = incorrect rejection of a null hypothesis (detecting true differences when there are none)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are researcher degrees of freedom?

A
  • it is common for researchers to explore various analytic alternatives and report only “what worked”

ambiguity in how to best make a decision
desire to find statistically significant results

→ self-serving justifications
(highly subjective, variable across replications)

  • flexibility in choosing among dependent variables
  • choosing sample size
  • using covariates
  • reporting subsets of experimental conditions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can be said about the influence of this flexibility on false-positive rates?

A

⇒ flexibility in analyzing two dependent variables (correlated atr= .50) nearly doubles the probability of obtaining a false-positive finding

⇒ adding 10 more observation until the findings are significant doubles the probability as well

=> controlling for gender or interaction of gender with treatment produces fpr of 11.7%

⇒ combination of all practices would lead to a false positive rate of 61%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the main problem of this flexibility?

A

often decide when to stop data collection on basis of interim data analysis

affects are not necessarily significant in a large sample if they are in a small one

18
Q

What requirements for authors do the researchers suggest?

A
  1. must decide the rule for terminating data collection before it begins
  2. at least 20 observations per cell
  3. list all variables collected in a study
  4. report all experimental conditions, including failed manipulations
  5. if observations are eliminated, they must report what the statistical results are if those were included
  6. covariate - report it with and without
19
Q

What guidelines for reviewers do the researchers suggest?

A
  • ensure the authors follow the requirements
  • be more tolerant of imperfections of results
  • demonstrate that results do not hinge on arbitrary analytic decisions
  • conduct exact replication if not compelling
20
Q

What is the open science movement?

A

all elements of an experiment are completely accessible and clearly documented, then it
(1) increases the level to which exact replications can be conducted, and
(2) reduces the likelihood of researchers using questionable practices in their research, for example, it reduces the likelihood of p-hacking.

collection of several research practices
openness, transparency, rigor, reproducibility, replicability, and accumulation of knowledge

21
Q

What is p-hacking?

A

Data dredging (also known as data snooping or p-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives

e.g. Selective reporting of significant results from a series of hypothesis tests with different dependent variables

22
Q

What are some guidelines to ensure open science practice?

A

“good enough practice in scientific computing”

preregistration - arising from the need to promote purely confirmatory research and transparently demarcate exploratory research

→ cognitive biases, particularly confirmation and hindsight bias, and the pressure to publish large quantities of predominantly positive results

“almost no psychological research is conducted in a purely confirmatory fashion”

The solution lies in preregistration: researchers committing to the hypotheses, study design, and analyses before the data are accessible. In their paper,Wagenmakers et al. present an exemplary preregistered replication as an illustration of this practice.

making replication mainstream - a finding needs to be repeatable to count as a scientific discovery

teaching open science

23
Q

What else is important to ensure good research practice in psychology?

A

correct statistical knowledge and report

models, hypotheses and tests

24
Q

What is the current understanding of a statistical model?

A

complex web of assumptions
statistical model

mathematical representation of data variability
- often unrealistic or unjustified assumptions

defining the scope of a model
good representaiton of observed data and hypothetical alternative data that might have been observed

model is usually presented in highly compressed and abstract form

one assumption in the model is a hypothesis that a particular effect has
a specific size, and has been targeted for statistical analysis
→ study hypothesis

Much statistical teaching and practice has developed a strong (and unhealthy) focus on the idea that the main aim of a study should be to test null hypotheses

25
Q

What is the current understanding of probability and statistical significance?

A

quantities of hypothetical frequencies of data patterns under an assumed statistical model
→ frequentist methods

p-value = observed significance level

probability that the chosen test statistic would have
been at least as large as its observed value if every model assumption were correct, including the test hypothesis

the P value tests all the assumptions about how the data were generated (the entire model), not just the targeted hypothesis it is supposed to test (such as a null hypothesis)

-> says nothing specifically related to that hypothesis

-> number computed from the data, unknown before computation

continuous measure of the compatibility between the data and the entire model used to compute it, ranging from 0 for complete incompatibility to 1 for perfect compatibility

26
Q

Is the p-value a hypothesis probability predictor?

A

NO

it assumes the test hypothesis is true
it indicates the degree to which the data conforms to the pattern predicted by the model

it is NOT “ p = 0.01 means the null hypothesis has a 1% chance of being true”

p = 0.01 means that the data are not very close to the statistical model

27
Q

How does the p-value allow for inference about the effect being due to chance?

A

it is not telling us that a certain percentage accounts for chance alone having produced the effect

likelihood to see an effect if null hypothesis were true
p = 0.08
“if the medicine had no real effect, there would still be an 8% chance of seeing these improvements just by random chance”

how surprising is the data if the null hypothesis were true?

difference:
chance is not operating alone
it is not “there is 8% of it being due to chance”
it is “if there is truly no effect, there is still an 8% probability of the data occuring”

28
Q

What does a significant test result of p < 0.05 mean?

A

a small p value doesnt mean the test hypothesis is true or not

it means that the data is quite unusual if all the assumptions were correct

if our starting assumption were true (no real differences), there’s only a 5% chance (or less) that we’d see results as extreme as we did just by random chance.

29
Q

What does a large p-value mean?

A

NOT a favour of the test hypothesis
NOT no effect

if its not 1, it implies that another hypothesis is most compatible

indicates only that the data are incapable of discriminating among many competing hypotheses

30
Q

What is the detection of scientifically important relations?

A

not really from p-value
-> only comments on likelihood of data occurring under the present assumptions

more from CI
-> effect sizes are substantive?

31
Q

What can be said about the inference from p-values to effect sizes?

A

certain effect sizes can operate under many different models and hypotheses
therefore, the p-value doesn’t allow for inference of small or large effect size

always CI

32
Q

How should p-values be compared across studies?

A

it is a fallacy to assume that if the majority of studies have p > 0.05, that the overall evidence supports the hypothesis

-> all could fail to reach significance, but with statistical combination they could show it

-> individual studies do not allow for inference of effect!

33
Q

How sensitive are p-values to experimental conditions?

A

very

differences in population and resulting SE make certain p-values across studies not comparable

therefore, if two studies have same p-value
this doesn’t mean the results are in agreement
-> might be due to different differences

34
Q

What does a CI of 95% mean?

A

range between two numbers

it estimates the frequency with which an observed interval contains the true effect size if all assumptions are correct

It’s about the long-run accuracy of the method if you repeated the process many times under similar conditions.

The “95%” in a 95% confidence interval means that if we were to repeat the study many times, 95% of the confidence intervals calculated from those studies would contain the true effect size, assuming all the assumptions hold

35
Q

What are false assumptions of CIs?

A

Not a Probability Statement About the Specific Interval: For a given study, the 95% CI doesn’t mean there’s a 95% chance that this specific interval contains the true value. It’s about the long-term accuracy of the interval if the experiment were repeated many times.

Not a Refutation or Confirmation: Just because an effect size lies outside a 95% CI doesn’t mean it’s refuted or excluded. It signals that, under the assumptions made, such a result would be unusual, but it doesn’t provide absolute proof.

Not About Overlapping Intervals: Overlapping 95% CIs from two groups or studies do not necessarily mean there’s no significant difference between the groups. Nor does non-overlapping imply a significant difference.

36
Q

What is statistical power?

A

Power is about the capability of a test to detect an effect when it’s there, not about the probability of making an error in one specific test.

High power doesn’t validate the null hypothesis; it just means the test is likely to detect an effect if it exists.

The power of a statistical test is the probability that it will correctly reject a false null hypothesis. For example, if a test has 90% power, it means there’s a 90% chance the test will detect an effect if there is an effect to be detected.

37
Q

What should not be assumed about statistical power?

A

a certain percentage for being wrong doesnt apply to one single study, it applies to a range of administrations of the same model with the same power

you cannot say “i have a probability of being 10% wrong cus my power is 90%”

if p > 0.05
-> doesn’t mean this supports the null over the alternative
-> just means that in this test, even though the test has high power (is very good at detecting a true difference) it failed to detect sth

38
Q

What is statistical power?

A

The pre-experiment probability that the test will reject the test hypothesis. This is usually the probability that p will not exceed .05.

39
Q

What should guidelines be to ensure statistical accuracy?

A
  • examining the sites of effect estimates and confidence limits as well as precise P values
  • critical examination of assumptions and conventions used
  • if nonsignificance → alternative hypothesis
  • interval estimates
    CI → then P value
    depend on uncertain statistical model
  • pooled analysis or meta-analyses to remove study biases
  • Any opinion offered about the probability, likelihood, certainty, or similar property for a hypothesis cannot be derived from statistical methods alone
  • research reports (including meta-analyses) should describe in detail the full sequence of events that led to the statistics presented, including the motivation for the study, its design, the original analysis plan, the criteria used to include and exclude subjects (or studies) and data, and a thorough description of all the analyses that were conducted.
40
Q

what are the main inferences possible from CIs compared to only p-values?

A

CI Perspective: Provides a range of plausible values for the true effect and can help assess the precision of the estimate (wider intervals indicate less precision).
-> reliability over many administrations
-> possibility to discriminate between hypotheses

P-value Perspective: Provides a metric for judging whether the observed data would be surprising if the null hypothesis were true, but doesn’t quantify the probability of the hypothesis itself being true.
-> how extreme/likely are the data under the assumption that there is no effect