reliability pt 3 Flashcards
inter rater reliability
-Applies when judgment must be exercised in scoring responses (e.g., WAIS-IV VCI subtests)
- Item level (agreement on item scores)
- How much agreement on each individual item?
- Correlation between raters on assigned item scores
- Scale level (total score on scale)
- How much agreement on the total score?
- Correlation between raters on total scores
If there are more than two raters, take the mean of the correlations for each pair of raters (A & B, B & C, A & C)
inter rater reliability categorical decisions
When there is a finite number of categories to which each person being rated can be assigned
- Items: Pass/Fail (0,1)
- Items: 0, 1, 2
- Diagnosis: Present/Absent
Two methods for assessing:
- Percent Agreement
- Kappa
inter rater reliability: percent agreement
- Percentage of all cases for which both raters make the same decision (i.e., both assign a score of 0 or both assign a score of 1)
- Problem: Raters could agree simply by chance
- Percent agreement can OVERESTIMATE inter-rater reliability
- Kappa (Κ) takes chance agreement into account and is the preferred method for assessing inter-rater reliability
how to calculate percent agreement
Two raters independently decide whether an item score should be 0 or 1 for N individuals who complete the item
A = number of times item was scored 0 by both #1 and #2
B = number of times item was scored 0 by #1 and 1 by #2
C = number of times item was scored 1 by #1 and 0 by #2
D = number of times item was scored 0 by both #1 and #2
Percent agreement = percentage of cases for which both raters gave the same score (either both 0 or both 1) = (A+D)/N
calculating chance agreement for a score = 0
Total scores of 0 given for Rater #1 = A + B
Total scores of 0 given for Rater #2 = A + C
Proportion of cases given a score of 0 by Rater #1 = (A+B)/N
Proportion of cases given a score of 0 by Rater #2 = (A+C)/N
Chance agreement for a score of 0 =
(A+B)/N times (A+C)/N
calculating chance agreement for score = 1
Total scores of 1 given for Rater #1 = C + D
Total scores of 1 given for Rater #2 = B + D
Proportion of cases given a score of 1 by Rater #1 = (C+D)/N
Proportion of cases given a score of 1 by Rater #2 = (B+D)/N
Chance agreement for a score of 1 =
(C+D)/N times (B+D)/N
calculating total chance agreement
Add the chance agreement for a score of 0 to the chance agreement for a score of 1
(A+B)/N times (A+C)/N
PLUS
(C+D)/N times (B+D)/N
implications of reliability
- There is no single value that represents the reliability of a test … we must specify which type of reliability we are estimating
- The methods we have considered all permit us to estimate a specific type or source of error
- To estimate multiple sources of error simultaneously Generalizability Theory
- Test manuals will report all relevant types of reliability (test/retest; split-half; internal consistency; inter-rater)
standard error of measurement
- Reliability coefficients apply to the test itself
- The SEM permits us to estimate how much error is likely to be present in an individual examinee’s score
SEM in words
- Step 1. Subtract the reliability of the test from 1.
- Step 2. Take the square root of Step 1.
- Step 3. Multiply the standard deviation of the test by Step 2.
SEM and reliability
- The SEM is INVERSELY -
- If reliability is high, SEM is low
- If reliability is low, SEM is high
standard error of measurement according to classical reliability theory
According to Classical -Error is normally distributed around a mean of 0
-SEM = the standard deviation of the distribution of error scores
Using the probabilities associated with the normal curve
- The probability is 68% that the amount of error is within 1 SEM
- The probability is 95% that the amount of error is within 2 SEM
estimating error
We can use the SEM to make probability statements about the amount of error associated with an observed score
NOTE: To do this accurately, we have to use the exact values rather than the “approximate” values we used in Chapter 1 of the Manual
- The probability is 68% that the amount of error associated with an observed score is no more than +/- 1 SEM
- The probability is 90% that the amount of error associated with an observed score is no more than +/- 1.65 times SEM.
- The probability is 95% that the amount of error associated with an observed score is no more than +/- 1.96 times SEM.
confidence intervals for estimated true score
We can also construct confidence intervals around the estimated true score
-We can’t know the actual true score, but we can estimate it.
These confidence intervals tell us the range in which the person’s true score is likely to fall with a specified degree of certainty (probability)
These are the CI’s that are given in the table in the WAIS-IV Manual
Step 1. Calculate the estimated true score
Step 2. Calculate the standard error of estimate
Step 3. Calculate the desired confidence interval
estimating the true score formula in words
Step 1. Subtract the Mean (M) from the observed score (Xo)
Step 2. Multiply Step 1 by the reliability of the test (rtt)
Step 3. Add the Mean to Step 2.
standard error of estimate
Standard Error of Estimate (SEE) = SEM times reliability
CI’s around estimated score….
…..will sometimes be asymmetrical around the obtained score
Reason: regression towards the mean
- The estimated true score will always be closer to the mean compared to the observed score
- Est True Score > Observed Score when observed score is below the mean
- Est True Score < Observed Score when observed score is above the mean
difference between estimated true scores and observed scores
GREATER when
- Reliability is LOWER
- Observed Score is farther from Mean
LESS when
-Reliability is HIGHER
Observed Score is closer to Mean
standard error of difference
- Used to decide if two scores are “significantly different” from one another
- i.e., the observed difference between them is NOT just due to measurement error
how to find SED in words
- Step 1. Square the SEM of the first score
- Step 2. Square the SEM of the second score.
- Step 3. Add Steps 1 and 2
- Step 4. Take the square root of Step 3.
The SED will always be larger than the larger of the two SEMs
using SED
- Multiplying the SED by 1.96 gives the amount of difference required for the scores to be considered significantly different at p < .05.
- For the VCI and PRI, this difference is 1.96 times 4.50, or 8.82
- The VCI and PRI must differ by at least 8.82 points (rounded to 9 points) in order for the difference to be considered statistically significant (not just due to measurement error) at p < .05.
example of using SED
- In other words, differences less than 9 points could be due entirely to measurement error and therefore cannot be considered “true” differences
- VCI = 109 vs. PRI = 115. The difference is NOT statistically significant because the difference is only 6 points which is less than 9 pts.
- A difference that is less than 9 points could be due entirely to measurement error, i.e., the true scores actually might not differ from one another.
SED and WAIS-IV
- To get more precise values for the minimum differences required for statistical significance at p < .05, we can use Table B.1 on p. 230 of the Administration Manual.
- This table calculates values using the reliability of the indices within each specific age range.
- Reliability of the indices varies slightly with age
For most purposes, the differences given in the WAIS-IV Interpretation Manual are sufficient