Week 3 - Reliability Flashcards
What is reliability?
Reliability
•The precision or repeatability of a measurement
•Not a ‘yes or no’ thing but a continuum
•Based on criteria for what constitutes being unreliable or highly reliable
•Can depend on what you intend to use the measure for. •Origins in quantifying the error in a measure
What is an example of reliability?
If someone is very satisfied with life, and they consistently score highly on each item in the Satisfaction With Life Scale, it would be a sign that the measure is reliable. If someone is scoring high on some items, and low or in the middle on others because some of the items are unclear or poorly worded (e.g., are double-barrelled or use complex language –remember back to last week), the measure would be unreliable because of the error variance due to the poorly constructed items.
How do we interpret correlational strength?
- Pearson’s r: Takes a value between 0 and 1. Can be positive or negative.
- No single agreed way of interpreting strength, but one view is:
- .75 or more: strong linear relationships
- .45 to
Explain Reliability and Error
X = T+ e
Observed score = True Score + error
Error –things that can affect the accuracy of a scale/measure/test -can be things that you might not be able to control (e.g., how tired a respondent is feeling), or could be due to poorly constructed items. If we minimise these sources of error as much as we can, the score someone gets on a scale/measure/test should hopefully be a true reflection of whatever it is we’re trying to measure about that person.
Explain sources of error
Unsystematic
• Uncontrollable things that the researcher cannot (and may never be able to) control for.
Systematic
• in-built problem that causes a test to measure something it shouldn’t.
•Items not assessing the appropriate concept•Items that are unclear or double-barrelled
•Could also be due to the manner in which the test is administered.
Unsystematic errors are things you may not be able to control (e.g., a respondent being tired) and systematic error is based on things that you can control (e.g., poorly worded items)
What is Cronbach’s alpha?
• Correlates the score for each item with the total score (i.e., for individual respondents), and then compares that to the variance for all individual items.
• Index of internal consistency
• Tendency of items to correlate positively with each other.
• Range between 0 and 1
• Alphas tend to increase with the number of items
• Strictly speaking, applies to unidimensional tests
• But is often used to assess the reliability of multifactorial measures
Cronbach’s alpha is also based on correlations. In short, the more strongly correlated the items are with each other, the more reliable the measure will appear to be (i.e., a Cronbach’s alpha value closer to 1). However, Cronbach’s alpha isn’t perfect, and it can be influenced by the number of items in a scale/measure/test.
Interpreting Cronbach alpha. What is considered to be good reliability?
.70 for the early stages of research
.80 for basic research tools
At least .90 for clinical purposes
Alphas over .90 could indicate redundancy of items
It could depend on what you’re trying to measure. If you have a measure that could be used to diagnose a child with a learning disability, for example, you’d want it to be very reliable. If it isn’t, there could be a risk of failing to diagnose a child who actually does have a learning disability, or diagnosing a child with a learning disability that doesn’t have one.
For measures like the Satisfaction With Life Scale that are used for research purposes (e.g., to find out what life satisfaction is correlated with) rather than diagnostic purposes, reliability somewhere in the .80 to .90 range is fine. Reliability in the .70s is generally okay although not ideal, anything below that could be a sign of a problem.
What are the important aspects of Cronbach alph?
- Item-total correlation
- The correlation of the item with an overall score made up of the other items, without the item in question.
- Values of less than .4-.5 generally indicate that the item doesn’t really contribute much to the scale.
- Anything .3 or less it definitely a sign of a potential issue.
- Alpha if item deleted
- What the overall Cronbach’s alpha would be if the item was removed from the scale.
- A substantial increase in the alpha indicates a poor item. A minor increase can be ignored.
- This result does not tell us anything about the reliability of individual items!
Why is alpha important?
•Observed correlations shrink with reliability•Reliability, therefore, limits the accuracy of findings•Low reliability may result in not detecting correlations that exist
What is Test-retest reliability?
The same test is given to the same sample on two occasions.
Useful for trait measures that theoretically shouldn’t change much over time.
Should be strongly correlated at both time points.
Another form of reliability that is common to see is test-retest reliability. If you’re developing a measure of a personality trait, and the theory suggests that the trait shouldn’t change much over time, it should be the case that people score similarly on the measure at two separate time points. If scores are vastly different, it either suggests that change has occurred or that the measure isn’t very reliable.
What is a split-half?
•This was one of the first approaches to reliability and pre-dates Cronbach’s alpha
1 . Divide the test into equivalent halves•E.g., odd and even items. Never split down the middle.
2. Compute Pearson’s r for the two halves
3. Apply the Spearman-Brown formula (2*r / 1 + r)•Useful for longer tests where fatigue or practice effects could affect results.
Reliability tends to increase with the number of items.
Describe Parallel/Alternate forms
- Develop alternate versions of a test and administer both to the same group.
- Useful for looking at items, control for learning
- Participants are the same
- Items are different•Interaction between person, item, and occasion?
Another strategy, which is much more time-consuming, is to develop two versions of a test and to give them to the same group. For example, half of the group completes Test A and the other half completes Test B. Then you switch it so the groups complete the other version. If people score similarly on the counterbalanced tests, it’s a fairly good indication that the two tests must be of similar difficulty.