W5 Peer Review, Replication, Empathy Flashcards
What are the steps of peer review?
Step 1 = read by an editor, does it suit the journal, does it impact an area of study, is it exciting/influential?
Step 2 = send to 2-3 expert anonymous reviewers, most journals allow reviewers to see who the writers are, they make one of the following decisions
REJECT it for the journal
INVITE A REVISION, changes requested by the reviewers, sometimes need to change major parts of the experiment
Step 3 = authors might revise the paper in line with the reviewers requests, or argue why the changes are not necessary
Step 4 = if changes are made, another peer review is done, can reject, revise or accept the paper, sometimes happens
What are the challenges of going through peer review?
- Often multiple iterations of peer review
- Each round of peer review can take several months, eg. 2-3 months
- Often have new deadlines to revise the manuscript to be accepted as a revised manuscript
Why is peer review important? (3)
what is the counter-argument to this?
- Papers of poor quality are excluded
- Made up of multiple experts and editors to be as objective as possible
- Helpful and constructive to improve the manuscript
Other alternatives to peer review
Some argue EVERYTHING in raw form should be published, no gatekeeping or quality control, it will ‘self-correct’
What are the cons to only being able to access things as student/researchers vs. open. access?
(Scientific journals give you the choice between paying for your research or doing open access)
- general public gets excluded to scientific literature, excludes people from learning about psych, medical knowledge, pressure to only publish in mainstream journals and not start your own
OR - authors have to pay $$$ which may limit publications from less financed institutions and teams, causes biases in scientific literature / rich / western
What’s a solution for researcher vs. open access?
- Hybrid approach for limited papers to be paid / accessible OR
- Better science communication / transferred into a blog
What were the findings of the Reproducibility project?
Called replication crisis in psychology
- studied how well does effects replicate DIRECTLY
Of the 96% of original studies were stat. sig, ONLY 36% of replications were statistically significant
MEAN EFFECT SIZE of originals WAS HALF IN REPLICATIONS
What is direct vs. conceptual replication?
- Direct replication = method is repeated as closely as possible
- Conceptual replication = different methods are used to test the same hypothesis
What are the 4 reasons why a replicant study might fail to obtain the same results as the original?
- Unidentified contextual effects
- Unidentified individual differences = eg. participants in first study might have all had higher levels of something, and replicant study did not, but individual differences weren’t controlled for initially
- Original could be TYPE 1 ERROR, “false positive”, 5% of the time, 1 / 20 tests
- Poor research practices, eg. no Power calculations
Type 1 vs Type 2 error, which is more serious?
- Type 1 Error = Null hypothesis was true, we reject the null hypothesis “false positive”, we say there’s an effect even there’s not a real effect
More serious, false alarm/positive - Type 2 Error = Null hypothesis is not true, BUT retain null hypothesis “false negative”, there’s an effect but we say there’s no effect miss
How to avoid the ‘fishing expedition’ in research?
Decide a-priori about what you want to look for, otherwise you go on “fishing expedition” for statistically significant results
Eg. without uncorrected multiple comparisons, you can attribute noise in machines to neural activity in salmon
If we changed alpha level from .05 to .01 what errors would increase/decrease?
Type 1 error would decrease from 5% chance to 1% chance
Type 2 error would increase (saying there is no effect when there is an effect)
What would be the consequences of using an alpha level 0.1?
If alpha level decreased, things that might be valid would be rejected and not published, and more participants would be needed to produce the same results - need to increase SENSITIVITY to really determine whether an effect is there or not
What is power?
What are the 2 things Power is affected by?
- the likelihood of correctly rejecting the null hypothesis, of correctly detecting a significant effect statistically in a sample, if the effect is there in the population
Power is affected by number of PEOPLE and number of TRIALS
How does low Power increase Type 1 and 2 errors?
Increase in chance of random error variance in a person in the sample which affects the mean
Small sample sizes lead to both type 1 and type 2 statistical errors - commonly driven by outliers
5-10 years recently it has become the norm to conduct power analyses, and many older studies becomes harder to replicate because they were ‘underpowered’, eg. low sample size
What do you need to consider in high sample sizes?
- Larger sample sizes minimises the impact of outliers and increases the generalisability BUT you NEED to consider effect size, since a minute difference in groups might produce a significant result from HIGH Power, but the magnitude of the effect might be very minute
Is it justified to use the same amount of participants as previous studies?
No - better to do a Power analysis to figure out what is the minimum sample size to get decent power
This is limited by logistics and replication studies
After Power analysis, what do we do?
Have a firm ‘stopping rule’ for recruiting participants regardless of statistical significance
If you check for significance every participant or small subset, and stop once you’ve reached it, this can inflate type 1 error rate “false positives”
check if the task/performance was appropriately difficult, eg. avoiding ceiling and floor effects
How do you determine a formal stopping rule? and can we always test this?
- run a formal power analysis (can inform how many people you should study)
- No, its unfeasible to do power analysis in some complex designs, otherwise need to justify why you didn’t do one
Why are multiple experiments in a study good?
Try to replicate differences that you have already observed, because there is a 5% chance that the differences are JUST DUE TO CHANCE
Multiple experiments within a manuscript, eg. 2 experiments that find the same result = 5% * 5% chance of false positive = 0.25% of false positive
What 3 things are needed for transparency and what happens if we don’t include it?
- Describe all the decisions made
- Include both insignificant and significant variables
- Share de-identified raw research data ‘open science framework’ for others to replicate or check your data
Distorted sense of scientific literature
Published papers more likely to have type 1 errors, and there might be more findings showing no effect and only one showing an effect
Valuable knowledge lost
What is Pre-Registration? / helps reproducibility
- Formally committing to predictions, stopping participant recruitment, treating data
Submitted to public repositories PRIOR TO THE STUDY
Pros / Cons of pre-registration?
PRO - Promotes transparency and reduces researcher degrees of freedom
PRO - More useful for voluminous and complex variables like fmri voxel activations
PRO - better for non controversial studies, exploratory studies where the effect is not predicted
CON- makes researcher decisions appear immune to critique, it’s made in advance but not peer reviewed, gives a bit of immunity to decisions that might not be that justified
CON - Limited utility for simple designs, eg. don’t need to screen for accuracy
Why is Reaction time screening used and what are the subtypes?
- identifying potential outlier reaction times, eg. long reaction times from error variance
Absolute = Screening from an absolute value = exclude any RTs after 2000 ms,
Relative = relative to participants mean reaction time, screen anything 3 s.d above mean reaction time
Pre-reg alternatives
What is the registered report format and what is it used for?
- its a peer-reviewed plan for study, might be modified, ONCE IT’S ACCEPTED it guarantees to publish paper, REGARDLESS OF THE SIGNIFICANCE OF THE REPORT
- Used for hotly debated topics - gives protection for you as a researcher AND more established literature
Pre-reg alternatives
What is Replication Experiment 1 with Experiment 2, pros and cons?
- uses an existing framework, decreases type 1 error rate, goes from 5% to 0.25%, but the methodological flaws may simply be repeated
Pre-reg alternatives
What is the Multiverse approach?
measure all of the effect of the data decisions on outcomes and report on the results so readers can see how they affect the outcome, reveals potential confounding effects, explore ALL degrees of freedom
What is the counterargument to Pre-reg alternatives?
Stronger theoretical basis to psychological science- we won’t need pre registration!
Why can you get a null result? and what is the significance of a null result?
- There isn’t an effect/relationship
- There is an effect but you’ve failed to detect to detect it (Type 2 Error)
Null effects are still meaningful, and they can be published - you still get a green tick for detecting a non-relationship when it isn’t there
If you get a null result, what 5 steps should you take next?
- need to figure out whether there really isn’t an effect or just type 2 error
- Do multiple experiments showing the same null result
- Show alternative divergent evidence, eg. a and b show no interaction but a and c show an interaction under the same conditions of a and b
- Have 2 mains effects to rule out any alternate explanations that the variables weren’t manipulated properly, can support the absence of an interaction instead of alternative explanations
- Have high Power and reliability
Reliability =
Test retest =
Internal consistency =
Inter-rater reliability =
- consistency of a measurement over time
- reliability over time
- reliability across items
- reliability across researchers
What is Validity?
which is about whether a particular operationalization truly captures the intended psychological process
Relationship between reliability and validity in sufficient/necessary conditions?
Reliability is often said to be a necessary but NOT sufficient condition for validity, it is necessary to the extent that the underlying psych process will be stable over time
For dynamic processes, it is harder to determine reliability
A = consistent dots but not near the target?
B = dots are all over the place
C = dispersion of points away from target
D = dots consistently in the target circle
A = reliable, not valid
B = broadly valid dots where average would be close to psychological target construct, not reliable
C = not reliable but now average does not align with psych target construct, not valid
D = valid and reliable
Example of something reliable but not valid for attentional control?
IV Measurement: height
Is IV measurement: difference in RT between Stroop trials a RELIABLE MEASURE?
How does this affect validity?
not reliable since the effects are so similar between participants, as rank order in participants naturally shuffles around
Only partially valid measure, as reliability is an important prerequisite for a valid measure
Is IV measurement: BOLD response in frontoparietal region for attention task a RELIABLE MEASURE?
Reliable measure that stays same over time
Valid measure related to brain region on attentional control
A highly concentrated graph with most scores the same is?
Rank ordering spread across x-axis is?
- showing an effect at the group level
- showing rank order reliability, more spread, weaker effect
What is rank order reliability?
Pros / cons
- how well a measure is able to rank-order individuals within the sample, eg. person A scores lowest at T1 ALSO scores the lowest at T2
pros = Shows the robustness and replicability of an effect
pros = Reliable measurement at the INDIVIDUAL level
cons = does not guarantee a robust or replicable effect at the GROUP level
cons = Individual and group level reliability can conflict
Why is there Tension between group level effects and reliable individual differences/consistent rank orders?
- Because group level effects assume most individuals experience similar, large, extent
Eg. Stroop effect Assumes minimal between participant variation in Stroop effect, bc barely any participant variability in the stroop effect
Problematic = since many tasks that produce strong group level effects are not robust at the individual level because they are so robust only at the group level
Need to care about reliability in individual differences regardless of the research type
How to get good rank order individual reliability?
People to experience the effect at distinctly different levels consistently over time
What are the FOUR MEASURES to quantify reliability?
- Cronbach’s alpha = internal consistency between items, items need to measure the same construct, are responses to items consistent in measuring a construct?
- Split half correlation = 100 trials, split into 2 x 50 trials and calculate the correlation between them, should have consistency/reliability and be correlated, but error variability occurs depending on which trials are split and then contrast, do multiple split half correlations
- Test re-test correlation = how consistency scores are in the same test across 2 time points
- Intraclass correlation coefficient = used consistency of measurement in clustering data
Why is reliability important for measurement?
What is it most used for?
- Correlation between A and B = the maximum correlation between A and B that you can detect is funda-mentally constrained by the measurement reliability of variable A and variable B
low reliability = might show insignificant result when there really is an effect (Type 2 error / false negative)
- questionnaire based measurement in reliability coefficient
How much power needed to adequately detect an effect?
What is issue for participants?
minimum of 80% power
Reliability of both measures has a HUGE impact on the number of participants NEEDED to sufficiently detect an effect
Most studies have not had hundreds of participants, and thus probably don’t have much power, “underpowered” and will struggle to find an effect
What are the 2 consequences for having tension between group and individual effects?
- If one study has high reliability, but replication has low reliability, it can lead to a failure to replicate the findings
- Opposite, if original study had bad reliability and found no effect, makes it not replicable for follow up studies
What are the 3 ways to improve measurement reliability?
- More measurement / trials = greater measurement reliability
- consider different dependent variables -
- Reduce the intensity of the manipulations, since there is an inherent tension between group and individual effects, group level effects have lower reliability at individual level and rank ordering
Reducing manipulating might help reveal individual differences in the data, eg. emotional induced blindness to dial down down emotional stimuli to reveal more distinct differences in emotional response in individuals
Recap: affective vs. cognitive empathy?
- Affective empathy - feeling what someone else is feeling, aware of source of emotion, feel distressed when someone is upset
- Cognitive empathy - being able to understand someone’s thoughts, feelings, beliefs, especially when they differ from your own mentalizing / theory of mind
Both interact for social-emotional functioning
What are the findings for Neural correlates in affective vs. cog empathy?
Damage to inferior frontal gyrus have impaired AFFECTIVE empathy but intact COGNITIVE empathy
Damage to ventromedial prefrontal cortex have impaired COGNITIVE empathy but intact AFFECTIVE empathy
double dissociation
What is an alternative explanation if they only found that the ventromedial prefrontal cortex has impaired COGNITIVE empathy but intact AFFECTIVE empathy?
CE is more effortful for most people, so the single dissociation could be because CE is easier to damage as it is a more complex and cognitively demanding process
What 2 contrasting things activates the Temporoparietal junction (TPJ) and why?
cognitive empathy/TOM tasks AND invalid trials in Posner cueing paradigm
TPJ is important in the VENTRAL attentional network “Circuit breaker” between what we were doing and engage in something else, BOTTOM UP ATTENTION
The link between CE/TOM ability and doing well in invalid trials might reflect a higher ability to disengage from a salient stimuli (the self / an invalid cue), shift mentality / attention towards something different
RECAP: 2 main pathways for attention dorsal vs. ventral pathways
- Dorsal (top): frontal eye fields, intraparietal sulcus
Involved in voluntary / top down / goal directed attention, for tasks and concentration - Ventral (bottom): ventral frontal cortex and temporoparietal junction
Involved in bottom up / salient and unexpected stimuli, as a circuit breaker to notice exogenous things in environment