Lecture 4 Flashcards
What is Reliability?
-can we reproduce the results? are they trustworthy?
-reliability is only about measurement error
-observed score = true score + random measurement error
-looking at inter-individual differences
What are 3 factors that influence reliability?
-any event that creates inconsistencies in performance
-content sampling
-statistical factors
What does the formula rxx = ot2/ox2 mean?
-correlation between a test and itself
-correlation corresponds to the variance in the total score that is due to true score variance
-correlation varies from 0 to 1
What are some examples of events that create inconsistencies in performance?
-true changes in results (over time only)
-random changes in the persons being assessed (tiredness, sickness, etc.)
-random changes in the test administration process (location, instructions, noise, etc.)
-random changes in the scoring procedures
-random changes in the item response process (e.g., guessing)
What does the formula 1-rxx = oe2/ox2 = % Error mean?
-gives us the percentage of the total score that is in the random measurement
What are some examples of content sampling?
-representativeness of the items (luck)
-clarity of the items
-test length (longer tests = higher reliability)
What are some examples of statistical factors?
-regression toward the mean
-range restriction/extension
*extension: more variability in responses
*restriction: less variability; harder to get a strong correlation
What are some sources of error and their corresponding reliability assessment?
-stability/time sampling –> test-retest reliability
-inter-item consistency/content sampling –> alternate forms reliability OR split half reliability
-scale score reliability/content sampling –> cronbach alpha
-inter-rater agreement and inter-rater reliability
What is the test-retest reliability (stability/time sampling)?
-test administered twice at two separate time points to assess how stable the scores will be over time.
-time interval should carefully selected to: (a) limit the possibility of true score changes; (b) limit the effects of recall [which artificially inflate stability] (1/2 weeks - 1 month)
-we want a correlation between score obtained on both administrations of .800-.900 (depends on time interval and what is known about construct)
What is alternate forms reliability (inter-item consistency/content sampling)?
-2 equivalent alternative forms of the same test are administered either simultaneously (assess reliability due to content sampling) or at 2 separate time points (assess reliability due to content + time sampling)
-assesses item similarity across forms through correlation
-differences reflect random errors related to item sampling: difficulty, representativeness, clarity, guessing, etc.
-it is important to balance the administration order in 2 separate samples
What is a pro and con of alternate forms reliability?
-pro: no recall effect (not concerned about memory bias)
-con: not the same items
What is split half reliability (inter-item consistency/content sampling)?
-1 test is simply split in two (then correlate the 2 halves)
-important to ensure content similarity in this process; if all items are equivalent, split half process can be done randomly; fatigue effects can be controlled by splitting as a function of the order of appearance of the items
-differences in scores will reflect random error due to item sampling (difficulty, representativeness, clarity, guessing, etc.)
-underestimation: the longer the test, the higher the reliability
-Spearman Brown prophecy formula (k=2; 2 times longer) [-k is about how much longer or harder you want the test to be]
-not as accurate as alternative because there are a lot of different halves
What is the Cronbach alpha (scale score reliability/content sampling)?
-coefficient alpha (a) or KR-20 (Kuder-Richardson, formula 20 for binary items).
-roughly, this corresponds to doing all possible split-half, and combining them in a single estimate. (does this without dividing the test) [most precise way to assess reliability]
Are there cut-off scores to determine a good alpha?
-no, it depends:
-items in index/survey and tests of speed/power have no reason to be consistent/correlated with one another
-test length is positively related to reliability so we should expect higher alphas for longer tests (& vice versa)
-very short tests aiming to assess very broad constructs (content heterogeneity) will tend to have lower alphas
-Spearman-Brown prophecy formula can be used to estimate what the alpha would be based on a different number of equivalent items
What is the order of what we look at on reliability statistics table?
- Cronbach’s alpha [scale score reliability]
- alpha if item deleted
- corrected item-total correlation (how much each item measures the same thing as the test)
[scale mean and variance if item deleted is not important, but variability should drop dramatically]
(we look at this to see if some items are irrelevant and worth removing without affecting the alpha)