Lec 2 Flashcards
Sources of replication crisis
- Priming elderly concepts affects walking speed
- Semantic vs behavior priming
- Define Stat power
Replication crisis: what & why?
- Replication crisis: Difficulty/inability to replicate the results of many scientific studies in subsequent investigations
- Need independent replications: lab has nothing to do OR enemies w/ original lab
- Example 1: Priming elderly concepts affects walking speed
- Semantic priming: see word nurse; you may think about doctor
- Behavior priming: see words associate w/ older people (ex. Florida, Bingo, catalac); ppl who were primed with these “elderly primes” walked slower
- No one really bothered trying replicating these studies and publish them
- Until 15 yrs later, a Belgian group of scholars tried failed to replicate the study 15 times
- Hints that something was awry
- Bem published paper in JPSP, best journal in social psychology, documenting ESP
- ESP = extra sensory perception; ability to predict the outcome before it happens better than chance level
- As scientists, we have to skeptical of all claims
- Hardly anyone conducted replications
- Even when replications conducted, they were hard to publish
- Changed somewhat with open access journals; but authors need to pay and these journals are not as reputable
- Many “significant” findings despite low statistical power to detect effects
- Ex. men tend to be taller than women
- If the sample is small, power is low
- Stat power: ability to detect an effect assuming an effect is there
Medicine is also experiencing replication crisis
Once you see it, you can’t unsee it
- p < .05 is meaningless; it is an arbitrary benchmark
Sources of replication crisis
- 7 data analysis choices to make p < .05 (p-hacking)
- Simmons, Nelson, & Simonsohn 2011
- Study on listening to Beatles (When I’m 64) vs Microsoft sounds
- Methods
- Results
- Study on listening to Beatles (When I’m 64) vs Microsoft sounds
P-hacking - Abusing experimenter degrees of freedom: “normal” research practices make impossible possible
- there are many data analysis choices we can make to have p < .05
- Under-powered designs
* N=20 per cell was something we aspired to
- Under-powered designs
- Optional stopping
* Collect sample ask do I have p < .05- No keep collecting sample
- Yes you stop sampling data analysis
* 3. Optional starting - Have a pilot study
- If data is significant you don’t call it a pilot study
- If data is not significant keep restart the study again until you have data
- Optional stopping
- Dropping conditions
* Ex. you original have 3 conditions no sig data drop to 2 conditions so data is significant
* If you drop a condition, you change your hypothesis
- Dropping conditions
- Dropping dependent variables
* Selective reporting of DVs
* Measure a construct in 3-4 ways find out there’s effects for 1 measure but not the other 3 measures only report the 1 measure that works and don’t mention the 3
- Dropping dependent variables
- Dropping participants
* Sometimes we have to drop people
* Shouldn’t drop participants in a biased manner- Ex. just ppl b/c they work against the hypothesis
- Ex. include outliers that work for our hypothesis
- Dropping participants
- Use of exploratory moderator
* Ex. look for effects using specific participants like Christians, Women/Men only, Teens only
- Use of exploratory moderator
Simmons, Nelson, & Simonsohn 2011
- Apart from discussing how p-hacking works theoretically (prev slide), they did an experiment
- Study
- Gp 1: 20 ppl listen to Beatles (when I’m 64)
- Gp 2: (control) listen to Microsoft sounds
- Asked them – how old are you?
- Results: after listening to Beatles’ When I’m 64, they became and felt younger by 1.5 years
- Based on the p value (p < .05), we would think it is true
- But the researchers here used all the p-hacking techniques (prev slide) to come to these results
- IOW: if this “bullshit” finding can be presented as “true”; “reasonable” findings published presented as “true” can be in fact not true
Sources of replication crisis
- Publication bias
- File drawer problem
- Example
Publication bias
- Published literature is not the same a complete literature (b/c many don’t make it in the journals)
- Ex. Results don’t support H, results are boring, results don’t work out
- So meta-analysis may not actually represent all findings
- File drawer problem: positive results bias – publication bias where authors are more likely to submit and editors accept positive results over -ve or inclusive ones
- Ex. In lit, we know there are 9 +ve studies, 0 -ve studies
- In reality, there were 40 studies that were run
- Lit shows 9/9 are +ve; reality = 9/40 are +ve
- IOW we just know the numerator
Consequences of the replication crisis
- Open science collaboration 2015
- Methods
- Results
- 1 Issue about this study
- Why is 100% replicability is not desirable?
Consequences of the replication crisis
What happens when we try to replicate?
- Open science collaboration 2015
- Coordinated attempt (with over 200 authors around the world) to replicate 100 studies from 3 high-impact psychology journals
- Only 39% of studies were able to replicated; only 25% of studies in social psychology
- Note:
- This is not a representative sample – authors did a convenience sample; and may have chosen studies that are less likely to be replicable
- 100% replicability might not be desirable either
- We don’t want 100% b/c for studies we want to foster creativity
- Ex. we know this H has low probability tb sig; in the case if became significant, that’s cool
Consequences of the replciation crisis
- Why does publication bias make meta-analysis meaningless?
- Funnel plot
Failures to replicate studies - Who cares about a few non-replications?
- Replications only test robustness of one study
- Hundreds of studies support stereotype threat & ego depletion
- Meta-analysis to the rescue!
- Publication bias makes meta-analyses (practically) meaningless
- You only meta-analyze published data; you have no idea about the non-published
- Funnel plots can spot problems
- Funnel plots: look at sample size (iow more power) vs effect size
- Ex. Avg effect = .2
- Each dot = 1 study
- Some studies show -ve, show large +ve
- As there are more samples, there is more precise estimate; the overall estimate will converge on the real effect size
- Red = p > .05 = non-significant results; studies don’t make it into literature
- It causes average effect to be overestimated (at 0.3)
- NOTE: the quality of info coming out from a meta-analysis cannot be better than the quality of info that went in (Garbage in, garbage out)
Consequences of the replication
- 5 reasons why replications fail
- 2 ways to improve studies
- Define power
- 2 ways to increase Power
- onfirmatory vs exploratory studies
Psychology’s renaissance
- How is science self correcting?
- 4 ways we are showing improvement
Some argue that our field is fucked out
Some argue that all is well
- Replications fail for many reasons
- Original result was false positive
- Different implementation of paradigm
* Perhaps unskilled experimenters
- Different implementation of paradigm
- Different experimenters
- Different populations (ex. sample in Canada vs China)
* Effect is context-sensitive, not general (Ex. only apply to NYU students not the overall population)
- Different populations (ex. sample in Canada vs China)
- Conceptual, not direct replication (ex. priming ppl in different methods to support priming)
* Theory might not generalize, but original finding might stand
- Conceptual, not direct replication (ex. priming ppl in different methods to support priming)
- If ego depletion (objects w/ 600 studies) has a problem, the field has a problem
How to improve? Consider power & confirmatory studies
- Power
- Probability of finding effect, when effect is real
- Previously we ignored power
- Run more high-powered designs
- Increase sample sizes
* Replace n=20 with N=200 rule of thumb?
- Increase sample sizes
- Within-subject designs: increasing amount of observations you collect w/in a person
- Understand the difference between confirmatory & exploratory studies
- Exploration result in high rate of false positives
- Need to run confirmatory study afterwards
- Exploration is fine, but don’t frame exploration as confirmation
- Exploration result in high rate of false positives
But: Psychology’s renaissance
- Science is self-correcting
- It is self-correcting only if scientists are correcting other scientists; can be painful
- b/c no one likes to be told they are wrong
- It is self-correcting only if scientists are correcting other scientists; can be painful
- We are showing signs of improvement
- More powerful studies (ex. can’t publish a study w/ a sample of 20 ppl)
- More awareness of problems
- Many changes at journals
- More replications
- More null results
- Pre-registration of hypotheses
- Badges for open science
- Open data
- Open materials
- Pre-register
Self-control predicts the good life
- Walter Mischel’s Marshmallow Test
- methods
- DV
- 6 things delay of gratification predicts
- fMRI study results
- SC and brain activity
- 2 strategies to delay gratification
- 3 Criticism of the Marshmallow Test
Self-control predicts the good life
Walter Mischel’s Marshmallow Test
- 4-6 yo Stanford Uni Daycare kids (from 1960s) given choice between 1 marshmallow now vs 2 marshmallows in 15 minutes
- DV: how long could children wait before eating the one marshmallow?
Delay of gratification in kids predicts the “good life” in adulthood
- Time to delay as child predicts adult:
- SAT scores
- Educational attainment
- Body Mass Index (BMI)
- Drug use
- Rates of divorce
- Activity in Human brain
- fMRI study results: the longer the kid can wait, the right interior frontal gyrus (IFG) was more active among ppl who can inhibit impulses
- For kids low SC; as adults, when the see rewarding stimuli (ex. money, sex, food), their ventral striatum is more active and RIFG is less active when tempted
- Delay is increased by strategies that shift:
- Attention: Distract yourself
- Appraisal: “See marshmallows as puffy clouds”
Criticism of the Marshmallow Test
- Small samples: Ex. only 27 participants inn brain study
- Relies on a trustworthy experimenter (Kidd et al.,
- Is this test of trust for authority?
- IOW: the kids may question is this experimenter trustworthy? If he is, it makes sense to wait; if not, it’s better to each the marshmallow right away
- Poor kids might do poorly on test yet this says nothing about self-control
- Ex. they encounter more scenarios like parents promise one thing, but since their parents don’t have money, they can’t keep their promise they trust ppl less, including the experimenter
- For them, delaying gratification
- Is this test of trust for authority?
- Replication (Watts et al., 2018):
- Half the effect size of original
- Effect disappears when controlling for SES & early cognitive ability
- IOW: the marshmallow study could reflect difference in SC, beliefs about the stability of the world, SES & IQ
Self control predicts good life
- delay discounting
- 4 things delay discounting predicts
Adult marshmallow test
- Discounting curve
- You only discount so much
Delay of discounting is important
- Delay = delay you are willing to make
- Discount = the discount you are willing to take
- Delay discounting (AKA temporal discounting)
- How much time/delay takes away (discounts) from present value of smth
- Predicts the good life
- Savings for retirement
- Credit card debt
- Procrastination
- Addiction
- Discounting large future pleasure for small current pleasure
Course Reader: The replication crisis is my crisis
- results on replication crisis
- Simmons et al’s view about stat sig studies
- Gelman’s view on honest rs
- 3 ways we are chasing smoke rather than results
- Ego depletion
- Problem with ego depletion studies
- Stereotype threat
- 3 things replication crisis calls for
The replication crisis is my crisis
- Project attempted to replicate 100 psych experiments
- Only 1/3 are replicated
- In social psych, only ¼ are replicated
- Key findings in Social psych were unreplicatable
- Ex. power posing influence hormones and boost confidence
- Ex. Reminding ppl on money influence opinions or b
- Ex. Administering oxytocin increase trust
- Ex. moral misdeed increase hand washing behavior
- This problem is are systemic, and come from how we conduct science
- Simmons et al
- State small data-analysis decisions can allow anything presented as statistically sig
- Due to flexible data collection in analysis practices, it makes impossible effects look possible and sig
- Gelman
- We don’t need to actively hack our data for it to lead to erroneous conclusions
- These biases in data analysis may not be conscious, and rs may not be aware that their decisions related to data screw their conclusions
- IOW: honest researchers may be reaching erroneous conclusions frequently
- Publication bias: Most serious problem; only publish sig results
- Journals force researchers to focus on these and ignore null results
- Aka file drawer effect
- We do not know whether the rs that get published is well supported
- These 3 ideas suggest we may by chasing smoke rather than results that are real
- Data flexibility can lead to a raft of false +ve
- This process occurs w/o rs being aware
- The size of the -ve results file drawer is unknown
- Inzlicht: studied ego depletion
- Ego depletion: we have a limited reservoir of energy to execise self-control and other mental capacities
- If we use that energy supply suppressing our desire to smoke now, you are more likely to break down and eat the last piece of pie later
- This idea was super influential
- Inzlicht’s work was critical of the model
- A study using a pre-registered replication attempt w/ 2000+ participants
- Results – nothing
- 3/24 labs found no effect
- 1 lab found a sig opposite effect
- If a mass study showed ego depletion is bogus, why do so many labs replicated the result previously?
- Stereotype threat: ppl are at risk at conforming to -ve stereotypes on the social gp they belong
- This impacts tests and job performances
- Explains the gender gap in science and math achievement
- But the rs may not be as robust as it seems; currently a mass study is trying to replicate this result
- Many other sub-fields in psych have this issue, even cancer medicine and economics
- This crisis calls for more stat power, transparency in null results, and confirmatory studies
Course Reader: How reliable are psychology studies
- 3 reasons why cog psych is 2 times more likely to be replicated compared to studies from social psych?
- 3 reasons why 2 attempts of the same exp produce diff results
- Mitchell’s POV on rs who replicate others’ studies
How reliable are psychology studies – Ed Yong
- Nosek and 270 peers repeated 100 repeated studies to see if they can get the same results a 2nd time
- Many classic TB experiments can’t be replicated
- Causes
- Publication bias: journals only publish +ve results (those that confirm the rs hypothesis)
- -ve results are left in “file drawer”
- Ex. check to see if they hv stat sig result b4 collecting more data
- Ex. only report “successful” experiments
- Aka p-hacking: trying to get +ve results from ambiguous data
- IOW: literature is filled by false discoveries
- Since the “reproducibility crisis” threatens credibility of the field, some argue this crisis DNE
- Publication bias: journals only publish +ve results (those that confirm the rs hypothesis)
- Result:
- 97/100 studies reported stat sig originally
- 36% of the replications were sig
- This doesn’t mean only 1/3 of psych results are true
- p < .05 = sig
- IOW if you do the study again, there is a 1/20 chance you will get sig results
- This threshold is meaningless b/c it suggests if the results skirt over .05, they are magically more “successful”
- This doesn’t mean only 1/3 of psych results are true
- For effect size (strength of a phenom): replications of effect size were half of those of the originals
- Ex. if red lights make ppl angry; effect size = how much angrier they get
- Nosek says: results aren’t great; this means psychologists are the first to tackle these problems
- This replication project shows that science if self-critical, questions its assumptions, methods, and findings
- The findings are still challenging to interpret
- Most controversial finding: cog psych is 2 times more likely to be replicated compared to studies from social psych
- The effect size from both disciplines declined; cog experiments have larger effects to begin w/ b/c social psychology deals w/ issues that depends on the context
- Ex. how the eye work is more consistent across ppl than how ppl react to self-esteem threat
- Cog experiment use w/in subject design; social exp use b/w subject design ppl vary way more in social psych experiments
- Failed replication don’t discredit the original study; successful studies don’t “enshrine” them
- Reasons why 2 attempts of the same exp produce diff results
- Random chance
- Original/replication exp flawed
- Different participants/methods
- Reasons why 2 attempts of the same exp produce diff results
Mitchell
- Want to know if stroop effect or endowment effect (ppl place more value on things they own) can be replicated
- Suggest researchers involved in this project maybe biased to “disproving” original findings
Nosek
- Clarifies most replicators worked w/ rs from the original studies
- Only 3/100 refused to help
- They pre-registered their plans
- Decided on every detail on methods and analysis b4 hand to prevent p-hacking
- Didn’t allow researchers to choose studies so they can take revenge on the original rs
- Those who failed to replicate studies were surprised
How to do better?
- rs should do public pre-registration of rs plans; specify H and methods in advance and in detail so they can’t cherry pick results
- Run larger studies by collaborating with other centers to get more participants
- Upload materials or code to open databases so it’s easy for others to check their work
There is change
- Rs pay more attention to replication, stat power, p-hacking
- Some journals started to publish results of pre-registered studies
- Rs work w/ other labs to replicate controversial early studies
- Center for Open Science award first 1000 teams who pre-register and publish their studies w/ $1000
- Efforts extend to other fields
Course Reader: A gradient of childhood self-control predicts health, wealth, and public safety – Moffitt
- Background
- Opt out policy
- Crime reduction policy
- 4 hypothesis
- Since self-control is malleable and low self control is influential, policy makers use “opt out” schemes to have ppl eat healthy food, save money, and obey laws
- The default options require no effortful SC
- If ppl have to opt out of default health-enhancing programs or payroll deduction retirement savings schemes, those w/ low SC tend to take the easy option to stay in programs as opting out requires unappealing effort
- Crime reduction policy: discourage offenders by making law breaking require effortful planning (ex. antitheft device in cars more effort to steal car)
- Looked at 4 policy-relevant hypothesis
- Looked at whether kid’s self-control predicted later health, wealth, and crime across a low to high self-control gradient
* If self-control effects follow a gradient, interventions that achieve small improvements in SC for individuals can shift the distribution of outcomes in a good direction
- Looked at whether kid’s self-control predicted later health, wealth, and crime across a low to high self-control gradient
- Since some ppl moved up the SC rank over the yrs in the study, rs can test the hypothesis that improving SC is associated w/ better health, wealth, and public safety
- Since the study looked at whether study members smoked as teens, left secondary school early, or became teen parents; rs can test the hypothesis that kids w/ low SC make these mistakes as teens, and this closes opportunities and put them in lifestyles harmful to health, wealth, and public safety
* If SC’s influence is mediated by teens’ mistakes, teens can be a good window for intervention policy
- Since the study looked at whether study members smoked as teens, left secondary school early, or became teen parents; rs can test the hypothesis that kids w/ low SC make these mistakes as teens, and this closes opportunities and put them in lifestyles harmful to health, wealth, and public safety
- Since the study assessed SC as early as 3, rs tested if indiv differences in preschoolers’ SC predict outcomes in adulthood
* Suggests early childhood can be a intervention window
- Since the study assessed SC as early as 3, rs tested if indiv differences in preschoolers’ SC predict outcomes in adulthood
Course Reader: A gradient of childhood self-control predicts health, wealth, and public safety – Moffitt
- Study 1: Dunedin
- Study 2: Siblings
Methods
Dunedin study sample
- Track 1040 ppl from 1972-1973
Childhood SC
- For the 1st decade of life: used 9 measures of SC
- The 9 measures are +vely correlated
Adult outcomes:
- Assess health, wealth, and crime outcomes were assessed at age 32
Sample for sibling-comparison analysis
- E-risk study
- Track 2230 twins in England and Wales in 1994-95
Childhood SC at age of 5Y
- Same SC measure in Dunedin study
Children’s outcomes at Age 12 Y
- Children report delinquent b and smoking
- Teachers rated their educational performance in Eng and math
Course Reader: A gradient of childhood self-control predicts health, wealth, and public safety – Moffitt
- Results
- After controlling for SES and IQ, what does childhood SC predict?
- Sibling comparisons
Results
- Rs looked at kid’s SC in 1st decade of life
- Collected reports from rs-observers, teachers, parents, and kids at age 3,4,7,9,11 yo
- SC avg were higher among girls than boys, but health, wealth and public safety outcomes were equal
- Results showed those w/ greater SC were more likely from high SES families and have higher IQ
- So, rs looked at whether childhood SC predicted adults’ health, wealth, and crime independent of social class origins and IQ
Predicting health
- When the kids became 32 yo, rs looked at their CV, respiratory, dental, and sexual health; and inflammatory status
- Merged the 5 clinical measures into a physical health index for each member
- 43% had none of the biomarkers
- 37% had 1
- 20% had 2+
- Childhood SC predicted adult health problems after controlling for IQ and SES
- Clinical interview w/ ppl 32 yo to assess depression and substance dependence based on DSM
- As adults, kids w/ low SC were not w/ elevated risk for MDD
- They hv elevated risk for substance dependence, even when controlling for SES and IQ
- Ppl who observed kids w/ low SC also rated them w/ substance use problems
Predicting wealth
- Study members’ social class origin and IQ were strong predictors of SES status and income
- Poor SC as incremental validity in predicting SES status and income
- At age 32yo, 50% members were parents
- Childhood SC predict whether these ppl’s bb were being reared by 1 vs 2 parent (SES and IQ were controlled)
- At age 32 yo, kids w/ poor SC were less financially planful
- Less likely to save and hv fewer financial building blocks
- Struggling financially
- Poor SC was a stronger predictor of financial difficulties than SES and IQI
- Verified by observers
Predicting Crime
- Hv records of participants’ court conviction
- 24% of participants were convicted of crime by age 32 yo
- Kids w/ poor SC were more likely tb convicted of criminal offence after controlling for SES and IQ
SC gradient
- Low SC –> worse health, less wealth, more crims
- High SC –> opp
- Removed 61 ppl w/ ADHD –> same results
- Also looked at whether SC effects operate throughout the gradient or it only affect low SC kids
- Results = same
- What would happen if SC improved?
- Looked at kids w/ increase SC from child to young adult
- Results: those w/ increase SC hv better outcomes after controlling for original SC
- The results maybe applied to interventions w/ caution
SC and adolescent mistakes
- Data collected at age 13,15,18, and 21 showed kids w/ poor SC are more likely to make mistakes as teens –> snares that trap them in harmful lifestyles
- Kids w/ low SC began smoking at age 15, left school early w/ no education qualifications, and became unplanned teenage parents
- The lower SC more snares encountered worse health, less wealth, crimes
- Looked at if snared explain LT prediction of SC
- Used stats control
* The snares weaken the effect of SC on health, substance dependence, SES, income, single parent child rearing, financial plans, financial struggles, and crime
* Direct effect on SC is sig
- Used stats control
- Association b/w childhood SC and adult outcomes among teens who did not hv snares (utopian control gp) is sig
How early can SC predict health, wealth, and crime?
- SC assessments from age 3-11 yo
- Preschooler’s SC sig predicted health, wealth, and convictions at age 32, mod effect sizes
Sibling comparisons
- Quasi exp rs design: isolate influence of SC is to track and compare siblings
- Qs: does sibling w/ poorer SC hv worse outcomes than his/her more SC sibling?
- Used environmental-risk longitudinal twin study (E-Risk) tracked birth cohort of British twins
- Twins were 5yo, rs rated each child’s SC up until 12 yo
- SC predict adult outcomes as seen in Dunedin study
- Result: 5 yo sibling w/ poorer SC were more likely to begin smoking at 12 yo (precursor of poor adult health), engage in antisocial b (precursor of adult crime)
- Sig even controlling for sibling diff in IQ
Course Reader: A gradient of childhood self-control predicts health, wealth, and public safety – Moffitt
- Comments
- What does SC predict?
- “one-two punch” scheduling of intervention
- Benefit of Universal intervention for SC
- Low SC and generation effect
Comment
- Diff levels SC as kids predict indicators of health, wealth, and crime
- Rs can isolate effects of child’s SC from effects of variation in kid’s IQ, SES, and hhome lives
- Should target SC for intervention policy
- Difference b/w kids in SC predict adult outcomes as well as low IQ and SES
- But low IQ and SES are difficult to change via intervention
- Low SC poor outcomes
- This supports why we should hv opt-out programs b/c adults avoid the effortful planning needed to opt out of default programs
- Opt out programs work best for those who are low C
- For timing of programs to enhance SC
- Findings suggest “one-two punch” scheduling of intervention in early childhood and teens
- Low SC in childhood adolescent mistakes (ex. smoking, leave school, hv unplanned bb) lifelong effects on health, wealth, and crime outcomes
- Intervention in adolescence that prevent consequences of teenager’s mistakes can improve wealth, health, and public safety of the population
- The fact that childhood SC predict adolescent mistakes implies that early childhood intervention can prevent mistakes
- Among teens who finished HS as nonsmokers and nonparents, one’s SC they had as kids explain variation in their health, finances, and crime when they are at their 30s
- Early childhood intervention that enhances SC more return on investment than harm reduction programs targeting teens alone
- Should early intervention to enhance SC take a targeted approach vs a universal approac
- Health, wealth, and crime outcomes follow a SC gradient
- This suggests that intervention can help those w/ high SC as. Well
- Universal interventions that help all can avoid stigmatizing anyone and hhv more support
- SC can change
- Kid programs that increase SC are +vely evaluated
- Looked at ppl from diff countries and eras result support that one’s SC influence health, wealth, and public safety and policy target
-
Dunedin participants: low SC had unplanned bb are now growing up in low income Single parent households
- Shows that one generation’s low SC disadvantage pass on to next generation
- Recent society demands our SC for survival
- Ex. stress health and wealth to avoid disability and poverty
- Ex. Imprison law breakers, ease of divorce, access to addictive substances, etc