Week 5-Reliability and Validity Flashcards
What are the 2 general dimensions considered when evaluating new measures?
- Reliability
- Validity
Define reliability
Reliability refers to the consistency of a measure – it is essentially about whether a measure is consistent. (scores should remain the same)
Reliability measures commonly take the form of correlation coefficients but there are different methods available.
What 3 types of reliability do psychologists consider?
- Over time (test-retest reliability)
- Across items (internal consistency)
- Across different researchers (inter-rater reliability)
What is Test-Retest Reliability?
-When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. For example, an intelligent person would score highly on an IQ test today and they should also score similarly highly on the IQ test if we test them a month later.
-Test-retest reliability is the extent to which this is actually the case (does the tool give the same measurement both times it is administered on an individual?).
-Assessing test-retest reliability requires using the measure on a group of people at one time and using it again on the same group of people at a later time, and then looking at the test-retest correlation between the two sets of scores.
-This is typically done by computing Pearson’s r.
-In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
What is the problem with Test-Retest Reliability?
The problem with this for many tools is that the second time the measure is administered it is not effectively under the same conditions. (do other things affect this?)
What is Intraclass correlations (ICC)?
Intraclass correlations (ICC) look at the absolute agreement between variables.
ICC:
-<0.5 are indicative of poor reliability,
-0.5 to 0.75 indicate moderate reliability
-0.75 to 0.9 indicate good reliability
->0.90 indicate excellent
When do high test-retest correlations make sense?
-High test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for constructs such as intelligence and self-esteem.
-However, some constructs are not assumed to be stable over time. For example, as mood changes over time, a measure of mood that produced a low test-retest correlation over a period of a month would not be an issue.
What is Internal Consistency?
-Internal consistency is the consistency of people’s responses across the items on a multiple-item measure.
-In general, all the items on such measures are supposed to reflect the same underlying construct.
-Thus, people’s scores on a set of items should be correlated with each other.
-For example, on the Rosenberg Self-Esteem Scale, people who agree that they are satisfied with themselves should also agree that they have a positive attitude toward themselves (i.e., high scores should be seen across).
Internal Consistency: What is the Split-Half Method?
-This method involves splitting the items on a questionnaire into two halves with each half measuring the same elements but in slightly different ways.
-For example, the items could be split into two sets such as the first and second halves of the items or the even- and odd-numbered items.
-Then a score is computed for each set of items and the relationship between the two sets of scores is examined.
-If a scale is very reliable, a person’s score on one half of the scale should be the same (or similar) to their score on the other half. Thus, across several participants, scores from the two halves of the questionnaire should correlate perfectly (or very highly)!
Internal Consistency: What is the correlation between the two halves in the Split-Half Method?
-The correlation between the two halves is the statistic computed in the split-half method, with larger correlations being a sign of reliability.
-A split-half correlation of +.80 or greater is generally considered good internal consistency.
-The problem with this method is that there are several ways in which a set of data can be split into two and so the results could be a product of the way in which the data were split (i.e., different results)
Internal consistency: What is Cronbach’s α?
■ The most common measure of internal consistency is a statistic called Cronbach’s α.
■ Cronbach’s alpha refers to how closely related a set of items are as a group.
■ Extent to which different items on the same test (or the same subscale on a larger test) correlate with each other.
■ Alpha coefficient ranges from 0 to 1: the higher the score, the more reliable the scale is.
■ A value of +.70 or greater is generally taken to indicate good internal consistency (Kline, 1999)
What are the threshold categories for reliability?
1 : perfect reliability,
≥ 0.9: excellent reliability,
≥ 0.8 < 0.9: good reliability,
≥ 0.7 < 0.8: acceptable reliability,
≥ 0.6 < 0.7: questionable reliability,
≥ 0.5 < 0.6: poor reliability,
< 0.5: unacceptable reliability,
0: no reliability
What are the problems with Cronbach’s alpha?
-It’s a lower bound estimate- this means it gives the lowest estimate of reliability (i.e. it’s pessimistic)
-Assumption of tau-equivalence - the same true score for all test items (all items have
the same factor or component loadings) this is unlikely and can reduce alpha
estimates by up to 11% if the assumption is not met
-More questions = higher alpha (i.e., false positive)
What is inter-rater reliability?
■ Inter-rater reliability is the extent to which different observers are consistent in
their judgments e.g., Bandura’s Bobo Doll Study
■ Inter-rater reliability is assessed using Cronbach’s α and ICC’s when the
judgments are quantitative or Cohen’s κ when the judgments are categorical e.g., behaviour categorised as good or bad.
What is Validity?
■ Validity is the extent to which the scores from a measure represent the variable they are intended to.
■ Essentially, validity is concerned with ascertaining if something does what it is supposed to do.
■ It therefore represents the truthfulness of a measure.
■ There are three basic kinds:
1. Face validity
2. Content validity
3. Criterion validity