Reliability Flashcards
TEST-RETEST
We consider the consistency of the test results when the test is administered on different occasions
only applies to stable traits
Sources of difference between test and retest?
Systematic carryover - everyones score improved the same amount of points - does not harm reliability
Random Carryover - changes are not predictable from earlier scores or when something affects some but not all test takers
Practice effects - skills improve with practice
Midterm exam twice - expect you to do better-
Time before re-administration must be carefully evaluated
Short time: carryover and practice effects
Long time: poor reliability, change in the characteristic with age, combination
Well-evaluated test: test-retest
Well-evaluated test - many retest correlations associated with different time intervals between testing sessions - consider events in between
PARALLEL FORMS
we evaluate the test across different forms of the test
use different items; however, the rules used to select items of a particular difficulty level are the same.
Give two different forms to the same person (same day), calculate the correlation
Reduces learning effect
CON: not always practical - hard to come up with two forms that you expect to behave identically
SPLIT HALF/Internal Consistency
Administer the whole test - split it in half and calculate the correlation between halves
If progressively more difficult - even odd system
CON: how do you figure out which halves? - midterm - don’t expect all questions to be the same
SPLIT HALF: Spearman-Brown Correction
allows you to estimate what the correlation
between the two halves would have been if each half had been the length of the whole test:
R = 2r/1+r
Corrected r = the estimated correlation between the two halves of the test if each had the total number of items
increases the estimate of reliability
r = the correlation between the two halves of the test
Assuming variance btw the two halves are similar
SPLIT HALF: Cronbach’s Alpha
The coefficient alpha for estimating split-half reliability
LOWEST boundary for reliability
Unequal variances
A = the coefficient alpha for estimating split-half reliability
O2x = the variance for scores on the whole o2y1o2y2 = the variance for the two separate halves of the test
SPLIT HALF: KR20 formula
Reliability estimate - math as a way of solving the problem for all possible split halves
S2 = the variance of the total test score
P = the proportion of the people getting each item correct (this is found separately for each item)
Q = the proportion of people getting each item incorrect. For each item, q equals 1-p.
Sumpq = sum of the products of p times q for each item on the test
to have nonzero reliability, the variance for the total test score must be greater than the sum of the variances for the individual items.
This will happen only when the items are measuring the same trait.
The total test score variance is the sum of the item variances and the covariances between items
only situation that will make the sum of the item variance less than the total test score variance is when there is covariance between the items
greater the covariance, the smaller the Spq term will be.
When the items covary, they can be assumed to measure the same general trait, and the reliability for the test will be high.
KR20 formula cons split half
split half: KR21 Formal
Similar - different version
does not require the calculation of the p’s and q’s for every item. Instead, the KR21 uses an approximation of the sum of the pq products—the mean test score
Assumptions need to be met:
most important is that all the items are of equal difficulty, or that the average difficulty level is 50%.
Difficulty is defined as the percentage of test takers who pass the item. In practice, these assumptions are rarely met, and it is usually found that the KR21 formula underestimates the split-half reliability
SPLIT HALF: Coefficient Alpha
Variance of all individual items compared to variance of test score
Tests where there is no correct answer - likert
Similar to the KR20 - sumpq - replaced by sums2i = variance of the individual items -sum individual variances
Factor ANalysis
Can be used to divide the items into subgroups, each internally consistent - subgroups of items will not be related to one another
Help a test constructor build a test tha has submeasures for several different traits
Classical test theory - turning away bc
- Requires that exactly the same test be administered to each person
- Some items are too easy and some are too hard - so few of the items concentrate on a persons exact ability level
- Assumes behavioral dispositions are constant over time
Item Response Theory
Basis of computer adaptive tests
focus on the range of item difficulty that helps assess an individual’s ability level.
turning away from the classical test theory for a variety of different reasons.
IRT, the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. For example, if the person gets several easy items correct, the computer might quickly move to more difficult items.
more reliable estimate of ability is obtained using a shorter test with fewer items
Item Response Theory - Difficulties
1 - method requires a bank of items that have been systematically evaluated for level of difficulty
2- Considerable effort must go into test development, and complex computer software is required.
Reliability of a Difference Score
When might we want a difference score - difference btw performance at two points in time, before and after a training program
In a difference score, E is expected to be larger than either the observed score or T because E absorbs error from both of the scores used to create the difference score.
T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created
The low reliability of a difference score should concern
The low reliability of a difference score should concern the practicing psychologist and education researcher. Because of their poor reliabilities, difference scores cannot be depended on for interpreting patterns.
Interrater Reliability
Kappa statistic
introduced by J. Cohen (1960) as a measure of agreement between two judges who each rate a set of objects using nominal scales.
Kappa indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement.
Values of kappa may vary between 1 (perfect agreement) and 21 (less agreement than can be expected on the basis of chance alone).
Greater than .75 = excellent
Interrater Reliability - Nominal scores
-1 less than chance
1 perfect agreement
.75 excellent
.40-.70 - fair to good
Less than .40 is poor
Sources of Error
Time Sampling issues
week later in state anxiety
source of error is typically assessed using the test–retest method
Sources of Error - Item sampling
some items may behave strangely
The same construct or attribute may be assessed using a wide pool of items.
Typically, the correlation between two forms of a test is created by randomly sampling a large pool of items believed to assess a particular construct.
This correlation is used as an estimate of this type of reliability
Sources of Error Internal Consistency -
we examine how people perform on similar subsets of items selected from the same form of the measure
intercorrelations among items within the same test
If the test is designed to measure a single construct and all items are equally good candidates to measure that attribute, then there should be a high correspondence among the items.
determine extent of internal consistency error by
evaluated using split-half reliability, the KR20 method, or coefficient alpha
Observer Differences - sources of error
untrained person, independent observations reconciled
Even though they have the same instructions, different judges observing the same event may record different numbers.
To determine the extent of this type of error, researchers can use an adjusted index of agreement such as the kappa statistic.
Standard error of measurement
Because we usually assume that the distribution of random errors will be the same for all people, classical test theory uses the standard deviation of errors as the basic measure of error.
tells us, on average, how much a score varies from the true score.
In practice, the standard deviation of the observed score and the reliability of the test is used to estimate the standard error of measurement
SDsqrt1-r
Not the standard error of the mean
Includes info about the reliability of the test
Use this measure to construct a confidence interval around a specific score
Upper bound = score +/- 1.96*SM
Bounds are around the observed score
1.96 multiplied by standard error of measurement - 95% CI
Tells me that X is my observed score, 95%, the true score will be within those boundaries
How much reliability is good enough?
Depends on what you are using it for
High-stakes consequences - need to have a good idea of how reliable it is
Reliability estimates in range of .7-.8 are good enough for most purposes in basic research - Some people have argued that it would be a waste of time and effort to refine research instruments beyond a reliability of .90.
In fact, it has even been suggested that reliabilities greater than .95 are not very useful because they suggest that all of the items are testing essentially the same thing and that the measure could easily be shortened.
For a test used to make a decision that affects some person’s future, evaluators should attempt to find a test with a reliability greater than .95.
What to do about low reliability?
Add items -
always makes it more reliable - if she gives one multiple choice question vs. 40 item, single item would not be reliable
Larger the sample - more likely that the test will represent the true characteristic
Item analysis
- go in and test how all individual items are doing - which ones are doing well
Each item in a test is an independent sample of the trait or ability being measured
Length Needed for any Desired Level of Reliability
N = the number of tests of the length of the current version that would be needed
Rd = desired reliability
R0 = observed reliability based on the current version of the test
Correction for Attenuation
If a test is unreliable, information obtained with it is of little or no value. Thus, we say that potential correlations are attenuated, or diminished, by measurement error.
True correlation btw tests 1 and 2 - estimate of true correlation between tests
OBserved will be an underestimate
If we could get everyone’s true scores
R12hat = the estimated true correlation btw tests 1 and 2
R12 = the observed correlation btw 1 and 2
R11 = reliability of test 1
R22 = reliability of test 2
Another way if you are concerned about one of your tests being unreliable
Domain Sampling Model
considers the problems created by using a limited number of items to represent a larger and more complicated construct
Use a sample
task in reliability analysis is to estimate how much error we would make by using the score from the shorter test as an estimate of your true ability.
reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score.
the greater the number of items, the higher the reliability.
Because true scores are not available, our only alternative is to estimate what they would be. Given that items are randomly drawn from a given domain, each test or group of items should yield an unbiased estimate of the true score.
Different random samples of items might give different estimates of the true score
To estimate reliability, we can create many randomly parallel tests by drawing repeated random samples of items from the same domain
Variance of Scores
Imagine a bunch of people taking the same test
Everyone has their own TRUE score (theoretical)
Everyone has their own observed scores
Variance = square of SD
We can calculate the variance of the observed scores
Theoretically, we could also imagine the variance of the true scores
Which would be bigger? - observed or true
Observed scores variance is larger - add error variance
Which would be bigger? - observed or true variance
Observed scores variance is larger - add error variance
Random vs. Systematic error
No error - bullseye
Random error - scattered spots in middle - we have accuracy but not precision
Systematic error - not a lot of variance but error is not randomly distributed - cluster somewhere else - precision but not accuracy
Practice effect - expect score to be different the second time - not-random - get better
Test that underestimates the ability of women - gave a systematically lower score
Observed score tends to be lower than true score
Reliability Coefficient
Ratio of the variance of the true scores on a test to the variance of the observed scores
R = the theoretical reliability of the test
o2T = the variance of the true scores
O2x = the variance of the observed scores
Use o bc theoretical values in a population rather than those actually obtained from a sample
R = percentage of the observed variation that is attributable to the variation in the true score - 1 = variance attributable to random error
Behavioral Observation
frequently unreliable because of discrepancies between true scores and the scores recorded by the observer
problem of error associated with different observers presents unique difficulties
estimate the reliability of the observers - interrater
record the percentage of times that two or more observers agree.
Not the best for 2 reasons
percentage does not consider the level of agreement that would be expected by chance alone - , if two observers are recording whether a particular behavior either occurred or did not occur, then they would have a 50% likelihood of agreeing by chance alone.
A method for assessing such reliability should include an adjustment for chance agreement
percentages should not be mathematically manipulated.
For example, it is not technically appropriate to average percentages. Indexes such as Z scores are manipulable and thus better suited to the task of reliability assessment.
TO ensure that items measure the same thing:
Factor analysis - tests are most reliable if they are unidimensional - one factor should account for considerably more of the variance than any other factor
Disciminability analysis - when the correlation between the performance on a single item and the total test score is low - item is prob measuring something different from other items on the test