Wk 4 - Reliability Flashcards
What is classical test theory?
What equation does it give us for the relationship between scores? (x3)
It’s the conceptual basis for psychometrics
The observed score = True score + Error of measurement
X = T + E
What is true score theory? (x1)
Another name for Classical Test theory
What is reliability in terms of the relationship between true and total variance? (x4)
According to Classical Test theory
r = true variance (hypothetical variation of test scores in a sample if no measurement error) divided by total variance (actual variation in data - including error)
r = omega(sq, T) over omega(sq, X)
Therefor measurement error is inversely related to reliability – lower measurement error = higher reliability
Why do we describe classical test theory in terms of variance rather than standard deviations? (x2)
Because variance is additive and can be broken up into its components
Whereas SD can’t
What are four sources of measurement error?
Test construction
Test administration
Test scoring
Other influences
What is item sampling/content sampling? (x2 plus egs)
Content sampling would be testing every aspect of the trait/skill
eg aaaaaalll the content of the course over 24 hours
Item sampling is testing a representative proportion of it
eg the 2 hour exam
Why can we only estimate the reliability of a test and not measure it directly? (x2)
Because true variance is a hypothetical/theoretical construct -
We can’t measure everyone on the planet and work it out
Name and describe four methods available to us to help estimate the reliability of a test.
Test-retest – how do scores correlate if people sit the same test twice?
Alternate-forms – how do scores correlate if people do two different versions of same test
Internal consistency – how much do the items in a test correlate with each other, on average? (Cronbachs, KR-20)
Inter-rater reliability – check the correlation on two/more different examiner ratings
Describe the steps involved in calculating Cronbach’s alpha by hand.
Split questionnaire in half
Calculate total score from items in each half
Work out correlation between those totals (the two halves)
Repeat steps 1-3 for all possible two-way splits of the total number of items
Work out the average of all the possible split-half correlations
Adjust correlation to account for the fact that you’ve shortened (halved) the test – special version of the Spearman-Brown formula (the less items, the lower the reliability correlation. So, when you cut the test in half you’re artificially lowering the reliability.)
How is the KR-20 calculated?
As with Cronbachs, this formula gives you an estimate if you’d worked out the mean of the correlations between all possible halves of your questionnaire (then corrected for halving)
What’s the difference between parallel forms and alternate forms? (x3)
Both give correlation between scores on 2 versions of same test by same people at same time, but
Alternate forms just needs high Coefficient of Equivalence, whereas
Parallel also requires that Mean, SD and correlations with other tests must be the same
What is the coefficient of equivalence in the context of a test with parallel forms? (x3)
The correlation between two versions of the same test
Applies to parallel and alternate
Also another term for the reliability coefficient used in these methods
List five considerations that might affect which reliability estimate you can use
- Homogeneity/heterogeneity of the test
- Static vs dynamic characteristics
- Restriction of range/variance
- Speed tests versus power tests
- Criterion-referenced tests
What is a homogeneous test? (x1)
If the test items all measure the same thing.
What is a heterogeneous test? (x2)
If more than one independent thing is being measured
i.e. there are subscales that don’t intercorrelate highly
Describe in detail exactly what the standard error of measurement is supposed to represent (x4)
The reliability of a test will never be 100%
So knowing the margin of error is critical in interpreting the meaning of an individual’s scores
Assuming a normal distribution of those test scores, and that their true score is at the centre, the SEM is the SD
Mostly hypothetical - we use the one time they did the test, to calculate what SEM would be is one person, one test, many time
What is the CI? (x1)
Why do we have to add and subtract DOUBLE the SEM from an individual’s score in order to get the 95% confidence interval? (x4)
Plus eg calculation
The range of scores that is likely to contain a person’s true scoreBecause under a normal distribution (assumed to be the case)
68% of scores are +/- 1 SD/SEM from mean, while
95% are within +/- 1.96
(99.7% are within 3)
WAIS IQ score of 105, SD of all IQ tests is 15, reliability is .98
SEM = 15 x sqrt(1-.98) = 2.12
CI = [105 +/- (2*2.12)] from 101 to 109
What is the reliable change index? (x1)
How do you calculate? (x1)
And how do you apply? (x1)
In clinical practice, another variation on SEdiff which is mathematically equivalent
Work out the diff between two scores (eg change during intervention), and divide by the SEdiff
If the RCI is greater than 1.96 (ie 2 standard errors of the difference) you have statistically significant change
True or false?
When calculating the Cronbach’s alpha, you have to multiply the correlations derived from all the possible split-half correlations of the items
And why? (x1)
False, you average them
True or false?
As part of the process of calculating Cronbach’s alpha, you need to adjust for the halving of the number of items by applying a special version of the Spearman Brown formula.
And why? (x2)
True, because the reliability correlation will be lower, the less items you have
So the test is artificially lowering it
True or false?
Imagine we revised a test and as a result the true variance became a greater proportion of the total variance. This would mean that the standard deviation of scores of a single person taking the test multiple times would become smaller
And why? (x2)
True
Because:
Total variation = True (systematic) variation + Error (unsystematic) variation
So to get the same Total while True increases, Error must be shrinking (ie the SD - the root of the variance)
True or false?
The fact that I cannot ask students everything about the course in the exam will decrease the proportion of the observed score that can be accounted for by the true score (assuming the exam mark is supposed to reflect students’ PSYC3020 knowledge)
And why? (x2)
True
Because:
Score obtained = True measurement + Error
So, inability to ask about ‘everything’ will increase the amount of error in the test, therefor to to retain the same observed score, the True score must go down.
How can we use variance to explain the relationship between true scores and measurement error, according to Classical Test theory? (x2)
How would this relate to SD of scores? (x1)
X = T + E
Total Variance = True variance + Error variance
SD is the root of variance, so the SD of score obtained would be influenced by SD of True and Error too
How does test construction affect measurement error? (x2)
Give an example (x1)
Whether you can access all of the trait, or only some of it (content vs item sampling)
ie error increases the less items you have that tap into what you’re measuring
A two-hour exam vs 24 hr that asked about absolutely everything
How could test administration affect measurement error? (x1) Give examples (x3)
Administering the test under different conditions could affect performance and therefor error
Distractions, fatigue, experimenter attitude
How could test scoring affect measurement error? Give examples (x3)
If scoring methods aren’t standardised, error would increase
Biased examiners, ambiguous scoring methods, technical errors
Give examples of Other Influences on measurement error (x2)
Self-efficacy
Motivational factors
What is the relationship of variance to reliability? (x2)
Higher variance means a higher spread of scores,
Indicating lower reliability of the test (more measurement error)
What are two other names for internal consistency?
What does it tell us? (x2)
Inter-item consistency, internal coherence
The average correlation between items on your scale
Are responses consistent to items that supposedly measure the same thing?
What are two measures of internal consistency?
When should you apply each one?
Cronbach’s alpha, when there are more than two possible outcomes for an item (eg likert scales)
Kuder-Richardson-20 (KR-20), when there are only two possible outcomes (yes/no, in/correct, true/false)
If you had four possible answers on an exam question, would you use Cronbach’s or KR-20 to calculate internal consistency? (x1)
And why? (x1)
Kr-20
Because there are still only two possible outcomes - wrong or right
What are the steps to using SPSS to calculate Cronbach’s alpha for internal consistency? (x6)
How does this differ for KR-20?
Select ANALYZE; SCALE; RELIABILITY ANALYSIS; select model ‘Alpha’.
Select all the items in your scale.
Click OK.
It doesn’t - if you have dichotomous variables, it automatically chooses the right formula
How do you calculate test-retest reliability? (x1)
What are some considerations to using this reliability measure? (x3)
Work out the correlation between scores on the same test by the same people done at two different times
Assumes stable trait, no practice effects/developmental changes
How can you get around practice effects in order to use test-retest type reliability? (x3)
Use alternate - equivalent difficulty and content, or
Parallel forms - Mean, SD of each test must be the same, but both versions also need equal correlation with entirely different tests
Both rely on high Coefficient of Equivalence - the correlation between two versions of the same test
How do you calculate inter-rater reliability? (x1)
Give an example of this in practice
The correlation between scores on the same test by the same person provided by 2 different examiners
As part of tutor training, one essay given to all markers, then argue over the different marks until they reach agreement/consistency
Describe how test hetero-/homogeneity might influence your choice of reliability test (x3)
Give an example
Homogeneous – test items all measure the same thing, use internal consistency
Heterogeneous – more than one independent thing being measured (i.e. there are subscales that don’t intercorrelate highly), use test-retest instead (though you could instead look at the internal consistency of each subscale separately).
Big Five scale – Cronbachs would be very low, because you’re measuring five traits that aren’t supposed to relate to each other
Describe how static vs dynamic characteristics might influence your choice of reliability test (x2) Give examples (x3)
Are you measuring something that is supposed to be stable, or change, over time? State or trait?
If it’s dynamic, test-retest reliability would be problematic
Eg, data shows that intelligence doesn’t vary much over time, while fatigue does
And, trait anxiety is a background level, longer term trait, whereas state is moment by moment – which you would expect to change
Describe how restriction of range/variance might influence your choice of reliability test (x2) Give examples (x2)
Inappropriate restriction of the amount that the scores in our sample can vary will affect (likely reduce) the correlation (and therefor ALL our reliability estimates – as they are all based on correlation)
Sensation-seeking increases with age - test people all the same age = potential restricting the range and artificial correlation reduction
Testing the reliability of an IQ test using only clever people, my range will reduce
Describe how using speed vs power tests might influence your choice of reliability test (x3)
Give an example
Speed vs level of difficulty of response
Internal consistency inappropriate for speed - assumes a fair go at all items, which you don’t get (just stop when clock runs out)
Use alternate-forms or test-retest reliability instead, or split-half (where administer 2 halves of the tests, timed separately)
For each row, determine whether the first number is the same or different than the 2nd number, write D if the numbers are different. Final score is how many you get through in given time
Describe how using criterion referenced tests might influence your choice of reliability test (x2)
Give an example
May be very little variation in responses
This is restriction of range - No variation? No correlation, so reliability tests don’t apply
Eg pass/fail tests where virtually everyone passes, ie diving cert exam, first-aid
What is the relationship between the number of items in a test, and the reliability coefficient? (x2)
Cronbach’s gets bigger, the more items in test – more reliable measure with more items measuring the same thing
So much so, that you can estimate/predict by how much using Spearman Brown Adjusted Reliability formula
BUT, diminishing returns - doubling test doesn’t double reliability
How do you calculate the Spearman Brown Adjusted Reliability (rSB)?
Plus eg
n (number of items in new test divided by number of items in old test), times by
rxx (reliability of original test (correlation) before adjusting); divided by
1 + (n - 1)rxx
Reliability of original test: alpha = .67
Increase the number of items in test from 10 to 15
• n = 15/10 = 1.5
• New estimated r = (1.5.67)/(1+(.5.67))
• = 1.005/1.335
• = 0.75
What does the Neale Analysis of Reading Ability measure? (x1)
How is it scored/referenced? (x4)
Oral reading, comprehension, and fluency of students aged 6 to 12 years.
Children read stories out loud, then complete a comprehension test.
Administrator notes down errors and time taken, to give measures of reading accuracy, rate, and comprehension.
Norm-referenced aptitude test - because scores are standardised against other children of the same age.
Raw score is converted into percentiles and stanines using tables that they provide with the test.
What is the standard error of the difference used for? (x5)
Plus eg application (x2)
To work out whether someone’s score is significantly different from:
Their own score on the same test at a different time
Their score on another test of the same thing
Someone else’s score on the same test
Someone else’s score on another test
Client with depression, test at beginning, then post intervention - significance demonstrates change unlikely due to chance/nature
How do you calculate the Standard Error of Measurement? (x3)
Plus eg
sx = standard deviation of test-takers’ (lots of people not one individual) scores
rxx = reliability of test
SEM = sx times sqrt(1 - rxx)
WAIS IQ score of 105, SD of all IQ tests is 15, reliability is .98
• 15 x root(1-.98) = 2.12
Give an example of RL situation where the CI is used (x4)
IQ scores:
Less than 70 counts disability
But, if they score 71, their real IQ includes the range 67 - 75
Which makes people want access to extra support that’s offered
What are two formula methods for calculating the SEdiff?
SEdiff = sqrt(sq[SEM1] + sq[SEM2])
Where SEM1 and 2 are the standard error of measurement for tests 1 and 2
SEdiff = sd times sqrt(2 - r1 - r2)
Where sd = standard deviation of test 1 (= standard deviation of test 2 because they’ve been standardized); and r1 & r2 = reliability of tests 1 & 2
How do you apply the SEdiff? (x3) Plus eg (x1)
95% CI means there need to be at least 2 x SEdiff between the two individual scores (because of normal distribution)
If so, you can say they are significantly different at 95% CI (95% not likely due to measurement error)
ie change in client’s happiness score needs to be more than 2 SEdiffs for it to be significant
If you had a dataset of test scores where the range of scores was substantially restricted, which estimate of reliability would be affected (assuming the data could in principle yield statistics on all reliability estimates)? (x4)
Internal consistency
Alternate forms
Test-retest
Inter-rater reliability.
A man has a motorcycle crash that involves a closed-head injury.
Before his crash, he completed an intelligence test where he scored 40 (mean 50, standard deviation 10).
After his crash, he completed the same intelligence test and scored 35.
If the standard error of the difference is 3 points for the test, then what is the best way to describe the data?
Why? (x3)
His intelligence has not significantly changed after the crash (95% confidence)
If the SEdiff is 3, then the man’s intelligence score needs to decrease by at least twice this to represent a statistically significant difference (95% confidence)
i.e. it needs to have decreased by 6 or more.
In fact, it’s only decreased by 5 (40 to 35) - so this change is not significant.
True or false, and why? (x1)
As part of the process of calculating Cronbach’s alpha, you have to split the questionnaire into two halves, calculate the total score for each half, and then multiply the total scores together.
False
You average the correlations – you don’t multiply them
True or false, and why? (x3)
As part of the process of calculating Cronbach’s alpha, you have to adjust for the homogeneity of the test by applying a special version of the Spearman-Brown formula.
False
You do indeed have to apply a version of the Spearman Brown formula,
But not because of adjusting for the homogeneity of the test. It’s because a half length test is likely to be less reliable than a full length test.
According to Classical Test Theory, if a test has very high reliability then… (x1)
Because…(x2)
Virtually all of the total variance must be accounted for by true variance
If virtually all of the total variance is accounted for by true variance, this means measurement error must be very low,
So reliability is high.
Two students score 89 and 94 in a multiple-choice IQ test, which has been shown to have a standard error of measurement of 3. The mean of the test is 85 and the standard deviation is 15. Are their scores significantly different (95% level of confidence)? (x1)
Why? (x4)
No
Using the formula for SEdiff, we put in the SEM of 3: square root of (3 squared + 3 squared) = square root of 18 = 4.24.
To be significant, the difference between the students’ scores must be more than twice the SEdiff (i.e. 2 x 4.24 = 8.48).
The actual difference (94 - 89 = 5) is less than this.
Therefore their scores are not significantly different
If you had a heterogeneous test, which estimate of reliability should you AVOID?
Why? (x2)
Internal consistency
If your measure is heterogeneous then internal consistency might be an inappropriate estimate of reliability
Because it assumes that all the items in your test are measuring the same thing.
What are alternate forms tests? (x2)
When there are two of more versions of a test that are equivalent in content and difficulty,
But may not have exactly the same means and standard deviation.
What are alternate forms tests? (x2)
When there are two of more versions of a test that are equivalent in content and difficulty,
And also have exactly the same means and standard deviation.
True or false and why? (x4)
When we calculate the confidence interval of an individual’s test score, we are assuming that their observed score must be their true score.
True
Key words that solve this are “individual’s score” and “calculation” -
To CALCULATE the confidence interval of an INDIVIDUAL’S test score (when we add and subtract twice the standard error of measurement),
We have to assume the true score is in the middle of the distribution
(we’re doing this whether we like it or not when we assume the confidence interval is symmetrical around their observed score)
True or false and why? (x3)
If a revised version of a test is found to be more unreliable, then this will increase the Standard Error of Measurement (assuming that the standard deviation of the revised test is the same as the original).
True
Increased test unreliability will be associated with an increased Standard Error of Measurement
(remembering that the formula for estimating SEM uses test reliability and test standard deviation - where the latter remains unchanged in this question).
True or false and why? (x2)
The Neale Analysis of Reading involves children being told a word and then pointing to the picture (in an array of four pictures) that corresponds to that word.
False
Because the statement describes the Peabody Picture Vocabulary Test,
Not the Neale Analysis of Reading (where the description doesn’t actually involve testing reading).
True or false?
The Neale Analysis of Reading includes scores based on accuracy and speed (amongst other things).
True
Imagine you had a questionnaire with 10 items and you were disappointed that its internal consistency was .69. What effect would adding another 10 items be predicted to have on the reliability? (x3)
Use the Spearman-Brown prediction formula shown in Lecture 3,
Where n = 20/10 = 2 (the 20-question-long new test is double the length of the 10-question-long existing test) and
rxx = .69
So the new reliability will be: (2 x .69)/(1 + ((2-1) x .69)) = .82