Wk 4 - Reliability Flashcards

1
Q

What is classical test theory?

What equation does it give us for the relationship between scores? (x3)

A

It’s the conceptual basis for psychometrics
The observed score = True score + Error of measurement
X = T + E

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is true score theory? (x1)

A

Another name for Classical Test theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is reliability in terms of the relationship between true and total variance? (x4)

A

According to Classical Test theory
r = true variance (hypothetical variation of test scores in a sample if no measurement error) divided by total variance (actual variation in data - including error)
r = omega(sq, T) over omega(sq, X)
Therefor measurement error is inversely related to reliability – lower measurement error = higher reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we describe classical test theory in terms of variance rather than standard deviations? (x2)

A

Because variance is additive and can be broken up into its components
Whereas SD can’t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are four sources of measurement error?

A

Test construction
Test administration
Test scoring
Other influences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is item sampling/content sampling? (x2 plus egs)

A

Content sampling would be testing every aspect of the trait/skill
eg aaaaaalll the content of the course over 24 hours
Item sampling is testing a representative proportion of it
eg the 2 hour exam

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why can we only estimate the reliability of a test and not measure it directly? (x2)

A

Because true variance is a hypothetical/theoretical construct -
We can’t measure everyone on the planet and work it out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name and describe four methods available to us to help estimate the reliability of a test.

A

Test-retest – how do scores correlate if people sit the same test twice?
Alternate-forms – how do scores correlate if people do two different versions of same test
Internal consistency – how much do the items in a test correlate with each other, on average? (Cronbachs, KR-20)
Inter-rater reliability – check the correlation on two/more different examiner ratings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the steps involved in calculating Cronbach’s alpha by hand.

A

Split questionnaire in half
Calculate total score from items in each half
Work out correlation between those totals (the two halves)
Repeat steps 1-3 for all possible two-way splits of the total number of items
Work out the average of all the possible split-half correlations
Adjust correlation to account for the fact that you’ve shortened (halved) the test – special version of the Spearman-Brown formula (the less items, the lower the reliability correlation. So, when you cut the test in half you’re artificially lowering the reliability.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the KR-20 calculated?

A

As with Cronbachs, this formula gives you an estimate if you’d worked out the mean of the correlations between all possible halves of your questionnaire (then corrected for halving)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What’s the difference between parallel forms and alternate forms? (x3)

A

Both give correlation between scores on 2 versions of same test by same people at same time, but
Alternate forms just needs high Coefficient of Equivalence, whereas
Parallel also requires that Mean, SD and correlations with other tests must be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the coefficient of equivalence in the context of a test with parallel forms? (x3)

A

The correlation between two versions of the same test
Applies to parallel and alternate
Also another term for the reliability coefficient used in these methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

List five considerations that might affect which reliability estimate you can use

A
  1. Homogeneity/heterogeneity of the test
  2. Static vs dynamic characteristics
  3. Restriction of range/variance
  4. Speed tests versus power tests
  5. Criterion-referenced tests
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a homogeneous test? (x1)

A

If the test items all measure the same thing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a heterogeneous test? (x2)

A

If more than one independent thing is being measured

i.e. there are subscales that don’t intercorrelate highly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe in detail exactly what the standard error of measurement is supposed to represent (x4)

A

The reliability of a test will never be 100%
So knowing the margin of error is critical in interpreting the meaning of an individual’s scores
Assuming a normal distribution of those test scores, and that their true score is at the centre, the SEM is the SD
Mostly hypothetical - we use the one time they did the test, to calculate what SEM would be is one person, one test, many time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the CI? (x1)
Why do we have to add and subtract DOUBLE the SEM from an individual’s score in order to get the 95% confidence interval? (x4)
Plus eg calculation

A

The range of scores that is likely to contain a person’s true scoreBecause under a normal distribution (assumed to be the case)
68% of scores are +/- 1 SD/SEM from mean, while
95% are within +/- 1.96
(99.7% are within 3)
WAIS IQ score of 105, SD of all IQ tests is 15, reliability is .98
SEM = 15 x sqrt(1-.98) = 2.12
CI = [105 +/- (2*2.12)] from 101 to 109

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the reliable change index? (x1)
How do you calculate? (x1)
And how do you apply? (x1)

A

In clinical practice, another variation on SEdiff which is mathematically equivalent
Work out the diff between two scores (eg change during intervention), and divide by the SEdiff
If the RCI is greater than 1.96 (ie 2 standard errors of the difference) you have statistically significant change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

True or false?
When calculating the Cronbach’s alpha, you have to multiply the correlations derived from all the possible split-half correlations of the items
And why? (x1)

A

False, you average them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or false?
As part of the process of calculating Cronbach’s alpha, you need to adjust for the halving of the number of items by applying a special version of the Spearman Brown formula.
And why? (x2)

A

True, because the reliability correlation will be lower, the less items you have
So the test is artificially lowering it

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or false?
Imagine we revised a test and as a result the true variance became a greater proportion of the total variance. This would mean that the standard deviation of scores of a single person taking the test multiple times would become smaller
And why? (x2)

A

True
Because:
Total variation = True (systematic) variation + Error (unsystematic) variation
So to get the same Total while True increases, Error must be shrinking (ie the SD - the root of the variance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

True or false?
The fact that I cannot ask students everything about the course in the exam will decrease the proportion of the observed score that can be accounted for by the true score (assuming the exam mark is supposed to reflect students’ PSYC3020 knowledge)
And why? (x2)

A

True
Because:
Score obtained = True measurement + Error
So, inability to ask about ‘everything’ will increase the amount of error in the test, therefor to to retain the same observed score, the True score must go down.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can we use variance to explain the relationship between true scores and measurement error, according to Classical Test theory? (x2)
How would this relate to SD of scores? (x1)

A

X = T + E
Total Variance = True variance + Error variance
SD is the root of variance, so the SD of score obtained would be influenced by SD of True and Error too

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How does test construction affect measurement error? (x2)

Give an example (x1)

A

Whether you can access all of the trait, or only some of it (content vs item sampling)
ie error increases the less items you have that tap into what you’re measuring
A two-hour exam vs 24 hr that asked about absolutely everything

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
How could test administration affect measurement error? (x1)
Give examples (x3)
A

Administering the test under different conditions could affect performance and therefor error
Distractions, fatigue, experimenter attitude

26
Q
How could test scoring affect measurement error?
Give examples (x3)
A

If scoring methods aren’t standardised, error would increase

Biased examiners, ambiguous scoring methods, technical errors

27
Q

Give examples of Other Influences on measurement error (x2)

A

Self-efficacy

Motivational factors

28
Q

What is the relationship of variance to reliability? (x2)

A

Higher variance means a higher spread of scores,

Indicating lower reliability of the test (more measurement error)

29
Q

What are two other names for internal consistency?

What does it tell us? (x2)

A

Inter-item consistency, internal coherence
The average correlation between items on your scale
Are responses consistent to items that supposedly measure the same thing?

30
Q

What are two measures of internal consistency?

When should you apply each one?

A

Cronbach’s alpha, when there are more than two possible outcomes for an item (eg likert scales)
Kuder-Richardson-20 (KR-20), when there are only two possible outcomes (yes/no, in/correct, true/false)

31
Q

If you had four possible answers on an exam question, would you use Cronbach’s or KR-20 to calculate internal consistency? (x1)
And why? (x1)

A

Kr-20

Because there are still only two possible outcomes - wrong or right

32
Q

What are the steps to using SPSS to calculate Cronbach’s alpha for internal consistency? (x6)
How does this differ for KR-20?

A

Select ANALYZE; SCALE; RELIABILITY ANALYSIS; select model ‘Alpha’.
Select all the items in your scale.
Click OK.
It doesn’t - if you have dichotomous variables, it automatically chooses the right formula

33
Q

How do you calculate test-retest reliability? (x1)

What are some considerations to using this reliability measure? (x3)

A

Work out the correlation between scores on the same test by the same people done at two different times
Assumes stable trait, no practice effects/developmental changes

34
Q

How can you get around practice effects in order to use test-retest type reliability? (x3)

A

Use alternate - equivalent difficulty and content, or
Parallel forms - Mean, SD of each test must be the same, but both versions also need equal correlation with entirely different tests
Both rely on high Coefficient of Equivalence - the correlation between two versions of the same test

35
Q

How do you calculate inter-rater reliability? (x1)

Give an example of this in practice

A

The correlation between scores on the same test by the same person provided by 2 different examiners
As part of tutor training, one essay given to all markers, then argue over the different marks until they reach agreement/consistency

36
Q

Describe how test hetero-/homogeneity might influence your choice of reliability test (x3)
Give an example

A

Homogeneous – test items all measure the same thing, use internal consistency
Heterogeneous – more than one independent thing being measured (i.e. there are subscales that don’t intercorrelate highly), use test-retest instead (though you could instead look at the internal consistency of each subscale separately).
Big Five scale – Cronbachs would be very low, because you’re measuring five traits that aren’t supposed to relate to each other

37
Q
Describe how static vs dynamic characteristics might influence your choice of reliability test (x2)
Give examples (x3)
A

Are you measuring something that is supposed to be stable, or change, over time? State or trait?
If it’s dynamic, test-retest reliability would be problematic
Eg, data shows that intelligence doesn’t vary much over time, while fatigue does
And, trait anxiety is a background level, longer term trait, whereas state is moment by moment – which you would expect to change

38
Q
Describe how restriction of range/variance might influence your choice of reliability test (x2)
Give examples (x2)
A

Inappropriate restriction of the amount that the scores in our sample can vary will affect (likely reduce) the correlation (and therefor ALL our reliability estimates – as they are all based on correlation)
Sensation-seeking increases with age - test people all the same age = potential restricting the range and artificial correlation reduction
Testing the reliability of an IQ test using only clever people, my range will reduce

39
Q

Describe how using speed vs power tests might influence your choice of reliability test (x3)
Give an example

A

Speed vs level of difficulty of response
Internal consistency inappropriate for speed - assumes a fair go at all items, which you don’t get (just stop when clock runs out)
Use alternate-forms or test-retest reliability instead, or split-half (where administer 2 halves of the tests, timed separately)
For each row, determine whether the first number is the same or different than the 2nd number, write D if the numbers are different. Final score is how many you get through in given time

40
Q

Describe how using criterion referenced tests might influence your choice of reliability test (x2)
Give an example

A

May be very little variation in responses
This is restriction of range - No variation? No correlation, so reliability tests don’t apply
Eg pass/fail tests where virtually everyone passes, ie diving cert exam, first-aid

41
Q

What is the relationship between the number of items in a test, and the reliability coefficient? (x2)

A

Cronbach’s gets bigger, the more items in test – more reliable measure with more items measuring the same thing
So much so, that you can estimate/predict by how much using Spearman Brown Adjusted Reliability formula
BUT, diminishing returns - doubling test doesn’t double reliability

42
Q

How do you calculate the Spearman Brown Adjusted Reliability (rSB)?
Plus eg

A

n (number of items in new test divided by number of items in old test), times by
rxx (reliability of original test (correlation) before adjusting); divided by
1 + (n - 1)rxx
Reliability of original test: alpha = .67
Increase the number of items in test from 10 to 15
• n = 15/10 = 1.5
• New estimated r = (1.5.67)/(1+(.5.67))
• = 1.005/1.335
• = 0.75

43
Q

What does the Neale Analysis of Reading Ability measure? (x1)
How is it scored/referenced? (x4)

A

Oral reading, comprehension, and fluency of students aged 6 to 12 years.
Children read stories out loud, then complete a comprehension test.
Administrator notes down errors and time taken, to give measures of reading accuracy, rate, and comprehension.
Norm-referenced aptitude test - because scores are standardised against other children of the same age.
Raw score is converted into percentiles and stanines using tables that they provide with the test.

44
Q

What is the standard error of the difference used for? (x5)

Plus eg application (x2)

A

To work out whether someone’s score is significantly different from:
Their own score on the same test at a different time
Their score on another test of the same thing
Someone else’s score on the same test
Someone else’s score on another test
Client with depression, test at beginning, then post intervention - significance demonstrates change unlikely due to chance/nature

45
Q

How do you calculate the Standard Error of Measurement? (x3)

Plus eg

A

sx = standard deviation of test-takers’ (lots of people not one individual) scores
rxx = reliability of test
SEM = sx times sqrt(1 - rxx)
WAIS IQ score of 105, SD of all IQ tests is 15, reliability is .98
• 15 x root(1-.98) = 2.12

46
Q

Give an example of RL situation where the CI is used (x4)

A

IQ scores:
Less than 70 counts disability
But, if they score 71, their real IQ includes the range 67 - 75
Which makes people want access to extra support that’s offered

47
Q

What are two formula methods for calculating the SEdiff?

A

SEdiff = sqrt(sq[SEM1] + sq[SEM2])
Where SEM1 and 2 are the standard error of measurement for tests 1 and 2
SEdiff = sd times sqrt(2 - r1 - r2)
Where sd = standard deviation of test 1 (= standard deviation of test 2 because they’ve been standardized); and r1 & r2 = reliability of tests 1 & 2

48
Q
How do you apply the SEdiff? (x3)
Plus eg (x1)
A

95% CI means there need to be at least 2 x SEdiff between the two individual scores (because of normal distribution)
If so, you can say they are significantly different at 95% CI (95% not likely due to measurement error)
ie change in client’s happiness score needs to be more than 2 SEdiffs for it to be significant

49
Q

If you had a dataset of test scores where the range of scores was substantially restricted, which estimate of reliability would be affected (assuming the data could in principle yield statistics on all reliability estimates)? (x4)

A

Internal consistency
Alternate forms
Test-retest
Inter-rater reliability.

50
Q

A man has a motorcycle crash that involves a closed-head injury.
Before his crash, he completed an intelligence test where he scored 40 (mean 50, standard deviation 10).
After his crash, he completed the same intelligence test and scored 35.
If the standard error of the difference is 3 points for the test, then what is the best way to describe the data?
Why? (x3)

A

His intelligence has not significantly changed after the crash (95% confidence)
If the SEdiff is 3, then the man’s intelligence score needs to decrease by at least twice this to represent a statistically significant difference (95% confidence)
i.e. it needs to have decreased by 6 or more.
In fact, it’s only decreased by 5 (40 to 35) - so this change is not significant.

51
Q

True or false, and why? (x1)
As part of the process of calculating Cronbach’s alpha, you have to split the questionnaire into two halves, calculate the total score for each half, and then multiply the total scores together.

A

False

You average the correlations – you don’t multiply them

52
Q

True or false, and why? (x3)
As part of the process of calculating Cronbach’s alpha, you have to adjust for the homogeneity of the test by applying a special version of the Spearman-Brown formula.

A

False
You do indeed have to apply a version of the Spearman Brown formula,
But not because of adjusting for the homogeneity of the test. It’s because a half length test is likely to be less reliable than a full length test.

53
Q

According to Classical Test Theory, if a test has very high reliability then… (x1)
Because…(x2)

A

Virtually all of the total variance must be accounted for by true variance
If virtually all of the total variance is accounted for by true variance, this means measurement error must be very low,
So reliability is high.

54
Q

Two students score 89 and 94 in a multiple-choice IQ test, which has been shown to have a standard error of measurement of 3. The mean of the test is 85 and the standard deviation is 15. Are their scores significantly different (95% level of confidence)? (x1)
Why? (x4)

A

No
Using the formula for SEdiff, we put in the SEM of 3: square root of (3 squared + 3 squared) = square root of 18 = 4.24.
To be significant, the difference between the students’ scores must be more than twice the SEdiff (i.e. 2 x 4.24 = 8.48).
The actual difference (94 - 89 = 5) is less than this.
Therefore their scores are not significantly different

55
Q

If you had a heterogeneous test, which estimate of reliability should you AVOID?
Why? (x2)

A

Internal consistency
If your measure is heterogeneous then internal consistency might be an inappropriate estimate of reliability
Because it assumes that all the items in your test are measuring the same thing.

56
Q

What are alternate forms tests? (x2)

A

When there are two of more versions of a test that are equivalent in content and difficulty,
But may not have exactly the same means and standard deviation.

57
Q

What are alternate forms tests? (x2)

A

When there are two of more versions of a test that are equivalent in content and difficulty,
And also have exactly the same means and standard deviation.

58
Q

True or false and why? (x4)
When we calculate the confidence interval of an individual’s test score, we are assuming that their observed score must be their true score.

A

True
Key words that solve this are “individual’s score” and “calculation” -
To CALCULATE the confidence interval of an INDIVIDUAL’S test score (when we add and subtract twice the standard error of measurement),
We have to assume the true score is in the middle of the distribution
(we’re doing this whether we like it or not when we assume the confidence interval is symmetrical around their observed score)

59
Q

True or false and why? (x3)
If a revised version of a test is found to be more unreliable, then this will increase the Standard Error of Measurement (assuming that the standard deviation of the revised test is the same as the original).

A

True
Increased test unreliability will be associated with an increased Standard Error of Measurement
(remembering that the formula for estimating SEM uses test reliability and test standard deviation - where the latter remains unchanged in this question).

60
Q

True or false and why? (x2)
The Neale Analysis of Reading involves children being told a word and then pointing to the picture (in an array of four pictures) that corresponds to that word.

A

False
Because the statement describes the Peabody Picture Vocabulary Test,
Not the Neale Analysis of Reading (where the description doesn’t actually involve testing reading).

61
Q

True or false?

The Neale Analysis of Reading includes scores based on accuracy and speed (amongst other things).

A

True

62
Q

Imagine you had a questionnaire with 10 items and you were disappointed that its internal consistency was .69. What effect would adding another 10 items be predicted to have on the reliability? (x3)

A

Use the Spearman-Brown prediction formula shown in Lecture 3,
Where n = 20/10 = 2 (the 20-question-long new test is double the length of the 10-question-long existing test) and
rxx = .69
So the new reliability will be: (2 x .69)/(1 + ((2-1) x .69)) = .82