Week 6-From crisis to crisis, Reliability and Validity part 2 Flashcards
What is the replication crisis (unable to replicate study) a product of?
– Actual fraud
– Questionable practices (P-Hacking, tweaking hypothesis slightly)
– Mistakes/lack of understanding of research methods and statistics
-Could it be a problem with how we measure things? whether uncertainties over how to do it (therefore it would be a measurement crisis!)
What is the Measurement Schmeasurement?
Flake and Fried (2020)
-Looked at questionable measurement practices
■ Education behaviour - Between 40% and 93% of measures reported in studies lacked reporting of validity evidence (Barry et al., 2014).
■ Emotion research - 356 measurement instances coded, 69% made no mention of the development process e.g., whether a measurement was compatible with a specific population (Weidman, Steckler & Tracy 2017).
■ This lack of consideration of measures is hugely problematic as the reader simply does not know if the scales are valid!
■ The ultimate conclusions made from a statistical analysis are highly dependent on the extent to which the measures are valid
■ If the validity of the tools is not demonstrated how can any conclusions from the study be valid? (Called the Garden of forking paths)
■ Does this produce an opportunity for p-hacking?
What is the Garden of Forking Paths?
■ Flexibility in how measures are used
■ A ten-item questionnaire can technically be summarised in 1023 different ways (This means there are lots of paths to explore to find one that leads you to the answer you want!)
Measurements Schmeasurement: How are there variations in questionnaires?
■ The Hamilton rating scales for depression has multiple versions (all with different psychometric properties)
– 6-item
– 17-item
– 21-item
– 24-item
– 29-item
■ It is important to be clear which version was used, so people can replicate the research and assess the validity of your measure (i.e., be transparent!)
How is the Garden of forking paths relevant to cognitive tasks?
■ Analytic flexibility still present
■ The addiction Stroop
BEER WINE GIN CIDER
BRIDGE TREE ROAD POND
-Have to say the colour and the word and there are similar words
Garden of forking paths: What did Jones et al. (2021) find with analytic flexibility with the computerised alcohol Stroop?
Method decisions;
* response (key press vs. voice),
* number of drug-related stimuli used,
* number of stimulus repetitions,
* design (block vs. mixed)
Analysis decisions:
* upper- and lower-bound reaction time cut-offs, removal of individual reaction times based on
* standard error cut-offs
* removal of participants based on overall performance
* type of outcome used
* removal of errors
■ 1,451,520 different possible designs of the computerised alcohol Stroop
How does the Garden of forking paths affect us?
■ Selecting a scale, you need to consider the psychometric properties of the scale and whether it is valid in your population of interest.
When writing your method section you should be explicit in terms of (i.e., make it clear to the reader):
■ the version used
■ number of items
■ response scales (e.g. 1-7 Likert scale)
■ Its reliability from past research or even in your sample!
How do you choose a scale?
■ Find validation papers.
– Check whether the scale has been validated for use with your population of interest.
■ Has the scale got good reliability?
– Check whether the scale is reliable in your population (check the Cronbach’s alpha)
■ Look at the number of citations.
– Is the scale still being cited today?
Why does low-reliability matter?
■ Low reliability limits your ability to find significant associations
■ Even with two measures with ‘excellent’ reliability of .9, a true correlation of .5 is reduced to .45.
True or false: Internal consistency relates to just questionnaires
FALSE You can also report internal consistency for cognitive tasks such as the Visual Probe Task and the Stroop task
Why is this important? Internal consistency: Cognitive tasks
■ Recently, Spanakis, Jones, Field, and Christiansen (2018) contrasted the psychometric properties of a basic (general alcohol words) and an upgraded (personalised pictures) Stroop task administered on a standard computer in a neutral room and on a smartphone in participants’ homes.
■ The researchers found that the Stroop task had acceptable reliability only when administered on a smartphone in a naturalistic environment and not
when completed on a computer in a neutral university room regardless of whether participants were exposed to words generally related to alcohol
(basic type) or personalised pictures of beer (upgraded type).
■ Thus, this research illustrates the importance of naturalistic settings when completing cognitive tasks as the internal reliability of the alcohol Stroop
task was acceptable only when administered on smartphones but not when administered on computers in a university setting (Spanakis, Jones, Field, & Christiansen, 2018).
How do you report the internal consistency for subscales? (example)
■ Internal reliability for the Fear of Negative Evaluation (α= .93) and Social Avoidance and Distress – New (α= .91) subscales was excellent and the internal reliability for the Social Avoidance and Distress – General (α= .83) was good.
What can Cronbach’s alpha be useful in?
■ Cronbach’s alpha can be useful in determining errors made in scoring.
What is the consequence of reverse scoring?
■ In developing scales with positively and negatively worded items is a very bad idea!
■ “But it ensures people read the question!”
■ Perhaps, but it adds systematic error, i.e. variance not caused by the construct we want to measure
– Variance is caused by slightly different ways people interpret things that are negatively vs. positively worded
What are smarter ways to ensure people are paying attention?
■ Add an attention check question instead
■ “Respond with Strongly agree to this question”
■ Delete anyone who doesn’t do this!
■ Avoids adding systematic error.