Lecture 10: Scales and Reliability Flashcards
Reliability –
consistency of measurement
Validity –
accuracy of measurement (Is the test or tool
measuring what it is meant to be measuring?)
________ Often assessed using a PCC.
For example, assessing _______ _______ often involves calculating the correlation between the measure and some other criterion for the same concept (e.g. the correlation between intelligence test
score and school grades).
validity
pearsons correlation coefficient
criterion validity
Reliability
development of multi-item scales, internal consistency of such scales may be measured using a function of the Pearson correlations between the items that make up the scale.
However, there are cases in which Pearson correlations are not an appropriate measure of reliability (e.g.).
the measurement of agreement between raters
Agreement between raters:
one criterion for reliability of a coding system for observations is ___ _______ _ ________ ______ _ _ __
the agreement between two (or more) raters.
• In psychological research, we are often interested in
obtaining reliable ratings of observations.
• If only a single person provides ratings, reliability cannot be assessed: we have no way of knowing whether the rater’s assessments are merely subjective impressions or assessments that can be agreed on intersubjectively.
• Yet, if two (or more) people rate the same observations
independently, we can assess whether raters are able to apply a given coding system consistently.
With one exception, Rater A rates consistently lower than the standard. The line in this plot represents perfect agreement: if the rater and the standard always agreed, then all points would be on the line
bias Rater A versus Standard
Rater B sometimes underestimates, and sometimes overestimates RF.
There is no evidence of bias, but Rater B’s ratings often differ considerably from the standard ratings. So Rater B’s ratings are unbiased, but imprecise.
Lack of Precision: Rater B versus
Standard
Rater C’s ratings are pretty precise. Although not all agree with the standard perfectly, all are ‘quite close’ to the standard. There is no indication of bias or of any other systematic mistake.
Good Agreement: Rater C versus
Standard
Rater D tends to overestimate scores when RF is low, and underestimate
scores when RF is high. Rater D tends to use values towards the middle of
the scale (points 3 to 6), in effect “shrinking” the scale. This might happen if
a rater is not very confident of their understanding of RF, and reluctant to
make clear judgements (i.e. low RF or high RF).
Scale Shift: Rater D versus Standard
In general, Pearson’s correlation coefficient ___ a good measure of inter-rater agreement –
why? B SD
is not
it does not detect bias (such as seen in Rater A), nor does it detect systematic differences in use of a scale (such as seen in Rater D).
A better alternative to measure agressment between raters
intraclass correlation coefficient (ICC).
According to Pearson’s r, Rater A’s ratings agree best with the standard,
and Rater C is only in “third place”.
ICC shows, more usefully, that Rater C’s ratings are most reliable. Note
that both Pearson’s r and ICC are good at reflecting the imprecision of
Rater B.
Intraclass correlation coefficient (ICC) The ICC is a number that theoretically varies between 0 and 1
1 would indicate perfect agreement
• 0 would indicate absence of any relationship between ratings (which would be expected if a rater picked their numbers at random).
• … when estimating the ICC, it can sometimes happen that you obtain a negative ICC estimate. This would also indicate poor agreement.
Inter-rater reliability
You can use the ICC to assess:
You can also assess:
There are no generally accepted standards to say what
constitutes a ‘high enough’ ICC to judge a rater to be reliable. In psychology, often ICC>__ is judged to be ‘good enough’.
the agreement of one or more raters with a set of ‘standard’ ratings.
the agreement of two or more raters with one another, even in the absence of a ‘standard’.
.7
Pearson’s r only measures
It does not detect bias, and does not necessarily detect all types of imprecision.
the linear relationship between ratings.
There are different types of ICCs:
- “Absolute Agreement ICCs” in SPSS, take into account both bias and imprecision
- “Consistency ICCs” in SPSS, do not take bias into account.