Trust Economics Week 9 Flashcards
This topic explores how much we should trust empirical results published in journals.
FACTS: replicability vs non-replicability
Replicability exists in sciences why we have law of physics etc.
Non-replicability in economics as studying animal behaviour (no heterogeneity when responding to stimuli)
What is the standard economic method of testing
Hypothesis testing i.e start with null.
E.g drug doesn’t cure cancer. (No relationship on cancer)
Reject null means drug does cure cancer. (There is a relationship)
P value
How unlikely the pattern in your data would have arisen, if null is true. (E.g drug does not cure cancer, but how surprised am I to see data that shows it does?) (level of confidence we can reject the null, smaller=better)
(Small p value=greater evidence null hypothesis is false, so drug does cure cancer)
2 problems with research practices harming credibility/trustworthiness of results
Publication bias
P-hacking
Publication bias
Publication likelihood greater with a smaller p-value
I.e smaller p-value = ones that have a significant effect, and reject the null.
P-hacking
Researchers make methodological choices in conducting their studies that tend to deliver lower P-values i.e significant effects. (Again no-one wants boring new with no relationship)
Remember P-hacking can still be accurate, just bias.
What is publication bias an example of?
A selection effect
Selection effect and example
What we see in newspapers/journals is not everything, but a filtered subset. Selection is not usually random, so needs to be considered!
e.g the WW2 Study by RAF of returning bombers and their pattern of bullet holes. TEACHES YOU TO ACKNOWLEDGE HOW SELECTION HAPPENED, AND THE SUBSEQUENT REASON FOR RESULTS
Why does publication bias arise
People want to see X is related Y, not X isn’t related to Y!
E.g people wanna see coke makes you bald, not coke doesn’t make you bald.
Opposed to nature of science, how does nature of statistics work? And examples
Enough studies enough times can create a purely spurious result
Jelly bean colour example, coke example.
Unlike where you apply heat to water, it always becomes steam.
Coke example explained
Questions men about their consumption for 15 drinks. Then whether bald or not.
Most drinks had nothing to do with hair loss, represented by blue dots.
Journalist only writes up on red dot (significant result)
No causation, but a positive correlation between coke o drinkers and being bald. (Maybe cos older men drinking more fizzy drinks!)
Examples of p-hacking decisions the researcher can make (3)
What data to collect
What sample
Defining variables
Example of p-hacking in hotel reviews
Hotel only request feedback for online review if they know their customers enjoyed their stay, hence why reviews are generally good on tripadvisor
Results can be probabilistic rather than deterministic- example
E.g giving employee of month might not motivate every worker, but a change in effort across a large sample/portion
We need to consider scale of these problems.
Stylised example without publication bias or p-hacking
Consider 1000 hypotheses, 50 true. We don’t know which. So experiment done on all 1000.
Experiment finds significant result if actually true=0.8 (“Power” of test)
Experiment finds significant result if actually false=0.05 (arbitrarily p-value cut-off)
Seeing a positive result, the chance of underlying hypothesis being true is
(0.8 x 50) / {(0.8 x 50) + (0.05 x 950)}= 0.46
Meaning of this
If we see a paper at a 5% significant that the result is true, the probabitly of actually being true is 0.46
LESS THAN HALF!
This example shows the extent of the problem, REMEMBER THIS EXPERIMENT HASN’T EVEN HAD PUBLICATION BIAS OR P-HACKING, SO IN THE REAL WORLD EFFECT EVEN WORSE
So what should we consider more (2)
Theory along with results to support the findings
High powered studies, more compelling with a bigger sample.
Marathon analogy
We would assume a small amount of people at the beginning, a lot in middle range and few at ending- a normal distribution.
In reality there are multiple spikes; people push to get 2:59 rather than 3:01 etc. so these cause the spikes with more people finishing at these times
Brodeur et al
Collected 13440 p-values to see their z-statistics distribution when unmanipulated!
Spikes around critical values (like the marathon example).
Show a major filtration process, meaning we don’t see certain results! (Ones where fail to reject i.e the boring ones)
So first way to collect evidence is to collect P-values and see the distribution (marathon and Brodeur).
Second way to see more evidence, and example
Try to replicate studies
If we collect a bunch of studies with p<0.05, this means more than 95% should replicate!
Nosek- replicated 100 research findings, only 39 could be reproduced. (39 IS NOT 95!!!)
So, 2 ways to show evidence and increase accuracy
Collect p-values and see their distribution
Replicate studies
Solutions
Change publication practices to allow null results to be more easily published.
Encourage replications via data sharing, journals publishing replications, university recognition etc.
Make replications easier e.g via relaxing data-sharing requirements: “open science” movement
Require “pre-registration” or “pre-analysis plans” to address p-hacking