Replication and good practice Flashcards
Lecture 14
p values
A p value of less than .05 is considered significant in psychology typically.
Rarely a p of < .01 is required.
There is a movement towards accepting data as a whole as opposed to relying on individual p values within a single study.
The p = .05 cut-off is fairly arbitrary despite its wide acceptance -> in simplified terms, this means a 1 in 20 chance the study results occured due to chance.
The p-value is determined by:
1. Sample size (larger sample, greater chance of significant effect if finding is real).
2. Size of your effect (larger effect, greater chance of significance at levels of sample size.
Power is a measure of how much your study will find a significant effect if a significant effect exists (good practice is around 90% power).
- Lots of studies have low power, like around 50% or lower even.
- This is bad because it increases the chance of a false conclusion (type 1 and type 2 errors go up).
The difficulty of social psychology
We’ve reviewed loads of work in this module on how people influence other people’s belief, values, etc.
We’ve reviewed loads of work on how culture, society (e.g. norms) do the same.
These things are constantly changing and shifting.
This is the case within cultures, but also between cultures, and across years, and even across days or times of a single year, or even a day (rainy or not, etc.)
Everything in social psychology studies is influenced by loads and loads of things. It is impossible to squeexe then into a study and especially into a single experiment.
The very knowledge of the effects spreading (through teaching, media, etc.) can react back onto the results of future studies. Teaching quite literally changes the subject area being taught.
People are not static: they react against a study, they try and guess what is happening, or they might try and be nice. Its a landmine of complications.
Lab experiments - blank slate
Lab experiments often ignore everything happening outside the lab.
This is okay in some cases as random assignment wipes these differences out between the groups.
But it is also possible those things outside the lab can contribute to a null replication (whatever is happening to all/most participants overpowers the effect).
It is also possible that those things outside the lab can explain an original effect (did the study for instance take place around a political event, or a holiday, or whatever else, and this might not be reported in the original study).
The importance of replication
Replication is an absolute cornerstone for anything to be considered a science.
We need to know if our results can be reproduced. We need to know if we can predict things. We need to know what context is required for a specific effect to occur as well.
This is pretty much the aim of science.
Things need to be studied in different groups to see if they apply. Not just within a culture but across a culture as well.
Things need to be studied across time. What applies at one point might not at another time point. This is across years. but possibly even across different days and times of the year.
Things need to be studied across situations.
This allows us to know the boundary conditions of effects, which is a benefit to replication effects.
The replication crisis
A few years ago, Nosek and colleagues set out to replicate 100 social psychology and cognitive studies. They recruited labs all over the world to help.
How a study was chosen:
- All studies chosen came from leading journals in the field.
- All studies had to be able to completed in a reasonable timeline.
- All studies had to not rely on some sort of expertise in a specific apparatus (like the BCI).
All researchers pre-registered their sample size and get at least 90% power based on the study they were trying to replicate.
25% of social psychology studies replicated using the p < .05 criterion.
36% of studies overall replicated.
If you combine new and replications into a pooled data, 68% would still be significant.
Things to consider…
Not all studies were exact replications.
Some differed on details like whether a camera was present.
Some were done in different cultures/languages.
That presents not just language culture issues but also relevance issues (one study was on reactions to housing and in the original study in the US, the students were living in this situation. The replication asked them just to think about the situation and was in a different culture).
Other replication attempts
Other replication attempts on a large scale have found a bit better.
A lot of this has to done with how the studies are chosen:
- If the researcher picks to try and challenge a result they think is dubious, the rate of replication is far lower than if the researchers choose trying to pick things they think will replicate.
- Some studies have been run ten times and replicated ten times.
Reasons for replication crisis
- Social psychology is complicated. Different cultures, time points (within and between years), different samples within countries, etc. -> norms change, values change, political leaders change.
- Outright fraud (this is quite uncommon as far as we know. Only a few social psychologists have been caught outright cheating).
- “the grey areas” of doing research. Often times decisions made can greatly impact the results. This isn’t even always conscious.
Recall post-hoc moral reasoning
We are really good at justifying our moral judgements after making them.
We come up with all sorts of good reasons.
The same is true of research. We can justify anything if we think hard enough.
- But those participants might not understand the material, they are ESL.
- But those participants are X, or Y … these didn’t entirely follow the direction, etc.
- But those particiants didn’t answer all the items … maybe they were in a rush, not paying attention, etc.
We can have very legit reasons for taking people out, but if we do it after looking at results (to get significance) it is very biased potentially.
And there is no way to tell at that point.
Poor practice
Grey areas:
P-hacking:
- When people look at their data and try to get it to become significant by doing something that is potentially justified, but also potentially very biased and dubious (e.g. only including part of a dependent variable, excluding people only after looking at the data, etc.).
- Recruiting people only unitl you have significance, then stopping.
- Ignoring interactions/not reporting effects you found to focus on main effects (make effect seem more applicable to all people/exaggerating context).
Poor practice:
- Not being transparent in the decisions you made when you try and publish your work.
- Not reporting studies that don’t find the results you want.
- Psychology has a huge overdose of significant results - it can’t be real.
Deciding in advance: focus
Researchers did studies where they recruited people and then tried to prove a ridiculous hypothesos (that music can make you older).
They basically found that if they p hack enough, after getting the data, they could find this effect.
With enough work, your grey area stuff can make almost anything significant.
Stats people in the field have found p values can inflate to 40-50% with even moderate levels of p hacking and dodgy practice.
Future
Open Science Framework:
- Register all of your intentions online ahead of time. It gets time stamped.
- Your data analysis plan, your hypotheses, your variables, your planned stopping point for recruiting - it is all there and you can’t find hide or change it.
- Some journals are giving extra merit to studies like this or are having special issues only devoted to these studies.
Sharing your data with researchers.
Making decisions ahead of time not once you have seen the data:
- Pre-determined sample size stopping point (ideally with a power analysis but not necessary).
- Pre-determined variables and your plan for them.
- Pre-determined analysing plan (what stats will I run).
Increased acceptance of pooled data (even if each individual study isn’t significant).
More publishing available for null results.
Much bigger emphasis on having an adequate sample size.
Red flags
If two sets of researcher publish findings after finding supporting their ideas and opposing the others, this might be a red flag.
- Data might be getting suppressed by either or both groups.
- Could be cultural or differences in design they aren’t noticing.
Too many underpowered studies for all significant.
- Lots of p = .04 across studies for the same effect.
- Sign of p hacking.
- Studies show these studies are less likely to replicate.
Loads of studies are all significant, but all these studies are underpowered/have a low number of participants.
- The chances of getting an effect if it exists, even at 80% power, across 5 studies in around 35%.
- Most studies are underpowered, so this number can be in the .05% range even.
- If you have 90% power, you should only get an effect 9 out of 10 times. An effect should not be found every time.
If a researcher has way too many academic publications in a year.
What does a failed replication mean?
- It could mean the original effect was spurious/never existed.
- It could mean that the original effect was real, but it doesn’t apply today.
- It could mean that the original effect exists, but it is moderated by another variable (that perhaps was changes in the replication attempt without the researchers thinking it would matter).
- This could be a sample issue -> was your sample a different age? A different gender? Different culture?
- Might apply to some groups and not others, or the moderator could be a situational thing.
- It could just be due to random change. Unless there is 100% power based on sample size, some failed replication is to be expected and not a sign of any concern at all.
Obstacles to better science
Stats illiteracy: not every academic who does research understands how bad these poor practices are/the massive effect they can have.
Overconfidence in your own ideas (the stats are wrong, I’ll tweak them to get what I want since my idea is right).
You got promoted, hired, etc. based on your publication record.
Thinking no one reads your work anyway so what difference does it make.
You get grants, etc., based on your publication record - it becomes a bit easier to justify your choices when you need a job or want a promotion.
Replication
There will always be some differences between replication attempts.
Conceptual replication: the study finds a similar effect with a different design and expands the original finding.
Direct replication: attempt to do the study exactly as it was originally as much as possible.