Lecture 3 - Reliability Flashcards

1
Q

Define reliability.

A

The extent to which a measurement tool gives consistent measurements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is reliability?

A

Consistency in measurement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Classical Test Theory?

A

The concept that any actual/observed score is a combination of an individual’s true score and measurement error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classical Test Theory is the traditional conceptual basis of psychometrics. T/F

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is True Score Theory?

A

Another name for classical test theory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a true score?

A

The aspect of what we want to measure, e.g. the underlying behaviour or trait that is captured by our measurement (Real intelligence or real level of extroversion)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is measurement error?

A

Everything captured within our observed score that isn’t what we wanted to measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If a whole egg was your observed score, what is the true score and measurement error?

A

Egg yolk - true score (e.g. middle of the brain, measuring intelligence, ability etc.)
Egg white - measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Observed score is not fallible. T/F

A

False. It is prone to make errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

True score is an ideal measurement (perfect and consistent) and constant for an individual. T/F

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Errors of measurement is random and unrelated to the true score. T/F
It can easily be eliminated. T/F

A

True.

False. Cannot be eliminated entirely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can the classical test theory (X=T+E) be described in terms of variation between people or more specifically, with variance?

A

Total Variation (X) = True Variation (T) (systematic) + Error Variation (E) (unsystematic)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is reliability in terms of the relationship between true and total variance?

A

Reliability is the proportion of the true score variance divided by the total score variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Variance is a way of measuring variation, and it is standard deviation squared. T/F

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why do we describe classical test theory in terms of variance rather than standard deviations?

A

Variance is additive, standard deviation is not.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give an example of X=T+E in terms of driving.

A

Total variation - variation of scores from questionnaires measuring low and high speeders.
True variation - peoples actual speeding behaviours
Error variation - errors due to questionnaire not reflecting what people’s choices actually are.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

True variance.

A

Hypothetical variation of test scores in a sample if there is no measurement error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Total variance.

A

Actual variation in data, including error variation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Lower measurement error = Higher reliability

Higher measurement error = Lower reliability

T/F

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

If a person took the same test multiple times, but their scores were more spread out. Would this be considered low or high reliability? Why?

A

Low reliability, because their scores were inconsistent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe the various sources of measurement error.

A

Test construction (e.g. item sampling/content sampling. Cannot ask every single detailed piece of content - some people may know certain answers that is part of the subset of questions included in an exam by luck)

Test administration (eg. whether/not there were any distracting noises when the test was administered)

Test scoring (whether markers are more stringent, biased examiners)

Other influences: motivation, self-efficacy, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is item sampling/content sampling?

A

Sample of items from the content of a whole construct of assessment that is included in a certain test or measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Why can we only estimate the reliability of a test and not measure it directly?

A

Because true variance is hypothetical and cannot be measured directly - therefore we can only infer reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Four methods available to help ESTIMATE reliability of a test.

A

Internal consistency; test-retest; alternate/parallel forms; inter-rater reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How much the item scores in a test correlate with one another on average (e.g. Cronbach’s alpha, KR-2-)

A

Internal consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

If a test involves an examiner making a rating - get two of them to do the rating independently and see how much their ratings correlate.

A

Inter-rater reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

If people sit the same test twice, how much do their scores correlate between the two sitings.

A

Test-retest reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

If people do two different versions of the same test, how much do their scores on the two versions correlate.

A

Alternate-forms reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Internal consistency.

A

Conceptually, this is the average correlation between the items on your scale. If all items on questionnaire are measuring the same thing, do individuals give consistent responses?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are alternative names for internal consistency?

A

Inter-item consistency

Internal coherence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

High internal consistency.

A

Scores from the items in a questionnaire that measure the same thing are consistent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Low internal consistency.

A

Scores are inconsistent. Unreliable test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is Cronbach’s alpha a measure of?

A

Internal consistency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

When should you use Cronbach’s alpha?

A

When there is more than two possible outcomes to a question.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How do you calculate Cronbach’s alpha in SPSS?

A

Select Analyze; Scale; Reliability Analysis; select model ‘Alpha’
Select all items in your scale.
Click OK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Describe the steps involved in calculating Cronbach’s alpha by hand.

A
  1. Split questionnaire in half.
  2. Calculate total score for each half.
  3. Compute bivariate correlation between total scores for each half
  4. Repeat with every possible split-halves of the questionnaire
  5. Work out the average of all split-half correlations
  6. Adjust the correlation using the Spearman-Brown formula.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does the Spearman-Brown formula achieve in regards to measuring Cronbach’s alpha?

A

Since the questionnaire has been split it will reduce the reliability of the questionnaire because your reducing the number of sample points. There is a relationship between the number of items and reliability. So spearman-Brown formula corrects this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is KR-20?

A

Kuder-Richardson 20. A measure of internal consistency used when the answers are dichotomous (eg. true/false, yes/no, correct/incorrect)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

In the examination, the examination is multiple choice, and that multiple choice has four possible responses. If I want to work out the internal consistency of the examination should I use Cronbach’s alpha or Kuder-Richardson 20 formula?

A

KR-20. Even though you’ve got four things to choose between, there are only two ways it can go. You can either get it right or wrong (two outcomes).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Test-retest reliability

A

Correlation between scores on the same test by the same people done at two different times.

41
Q

Give an example of when it is not appropriate to use the same test twice on the same person, and how you can fix this.

A

Hazard perception test - video clip - might recognise the hazard within the video when re-watching. Can use alternate forms of the same test (different but equivalent stimuli), and counterbalance the two versions (e.g. half participants see version 1 then version two, and vice versa)

42
Q

Parallel/Alternate-Forms reliability

A

The correlation between scores on 2 versions of the same test by the same people done at the same time.

43
Q

What is the difference between parallel forms and alternate forms?

A

Parallel forms have the same mean, standard deviation and correlation with other measures.

Alternate forms are only similar in content and difficulty.

44
Q

What is the coefficient of equivalence in the context of a test with parallel forms?

A

Correlation between two versions of the same test.

45
Q

Inter-rater reliability.

A

The correlation between scores on the same test by the same people provided by two different examiners.

46
Q

Which situations can affect which reliability estimates that you can use?

A
  1. Homogeneity/heterogeneity of the test
  2. Static vs dynamic characteristics
  3. Restriction of range/variance
  4. Speed tests versus power tests
  5. Criterion-referenced tests
47
Q

What is a homogenous test and a heterogeneous test?

A

A Homogenous test measures one variable (measuring the same thing) whereas a Heterogenous tests measure different variables, eg. DASS measures depression, stress and anxiety through 3 different subscales.

48
Q

Example of a homogenous test

A

Test of extroversion - all questions measuring extroversion, unless you had some underlying theory where you wanted to split extroversion into different subscales.

49
Q

Example of heterogenous test

A

Personality Inventory. Extroversion and Neuroticisim. Measuring different personality traits.

50
Q

If your measure is heterogeneous then internal consistency might be an inappropriate estimate of reliability (though you could look at the internal consistency of each subscale seperately). T/F

A

True.

51
Q

Give an example of a test measuring something that is Static (remain the same over time)

A

Intelligence (relatively static)

52
Q

Give an example of a test measuring something that is Dynamic (expect to change over time)

A

State of being - anxiety, fatigue etc.

53
Q

If a test is measuring something that is Dynamic, why would retest reliability be a problem?

A

Because test-retest reliability assumes that the thing being measured remains the same.

54
Q

What happens when there is a restriction of range/variance.

A

If the scores in our sample are inappropriately restricted in the amount they can vary then this will affect the correlation. And ALL our reliability estimates are based on correlations

55
Q

What is the difference between a speed test and power test?

A

Speed tests focuses on the speed of response rather than level of difficulty, whereas power tests focuses more on the level of difficulty of response.

56
Q

Why wouldn’t it be appropriate to measure internal consistency for a speed test?

A

Because people tend to get all the questions they attempt correct, but they just don’t have the time to attempt all the questions. This gives an invalid correlation between the items.

57
Q

What estimate reliability test should you use for speed tests?

A

Use alternate forms or test-retest reliability

58
Q

What special considerations do we need to take into account when analysing the psychometric properties of a speed test compared with a power test?

A

The internal consistency is not a good estimate of reliability.

Test-retest or alternate form reliability measures are better.

59
Q

In some criterion-referenced tests, there may be very little variation in people’s responses. Is this an example of restriction of range?

A

Yes. In some pass/fail tests virtually everyone might pass

60
Q

Why would it be a problem in using reliability estimate calculations in some criterion-referenced tests?

A

Because there might be a restriction in range, and since there is no variation in scores it’ll be a problem using any of the reliability estimates as they are derived from assessing score variance.

61
Q

Reliability tends to increase when you have more items and decreases when you have less items. True/False

A

True

62
Q

How can we estimate how the reliability would change if our test is shortened or lengthened?

A

Spearman-Brown formula

63
Q

Spearman-Brown formula used in estimating reliability when test is shortened or lengthened. What is rsb, n and rxx?

A
rsb = spearman brown adjusted reliability
n = number of times new test is longer than original (number of items in new test divided by number of items in old test)
rxx = reliability of original test (correlation) before adjusting
64
Q

What would the adjusted reliability be if you added another 15 items to an original test with 15 items and a reliability of 0.6?

A

rsb = 0.75

65
Q

Doubling the length of a test is doubling the reliability. T/F

A

False

66
Q

Does increasing the number of items in a test have diminishing returns?

A

Yes

67
Q

What does the Neale Analysis of Reading Ability measure (aptitude test)?

A

Measures oral reading, comprehension, and fluency of students aged 6-12 years.
May also be used to diagnose reading difficulties in older readers.

68
Q

How is the Neale Analysis of Reading administered?

A

Participants are asked to read a selection of stories, then complete a comprehension test on the story.

Examiners note number of errors and the time taken for participants to read the whole story. (1. Reading accuracy, reading rate and comprehension)

69
Q

No test is perfectly reliable (inaccurate measure of individual’s underlying trait), so it is critical to know HOW INACCURATE the scores are likely to be in order to make sensible judgement. T/F

A

True.

70
Q

Imagine a client repeatedly takes a test a number of times (assuming no practice effects), in order to work out the margin of error in an individual’s scores, what do we need to assume?

A

We need to assume that the distribution will be approximately normal (i.e. a bell shape).

71
Q

What does the standard error of measurement represent? And what does it tell us?

A

Standard deviation of the distribution of scores if an individual takes multiple attempts of a certain score. (How spread out the distribution is from the mean) Tells us the likely margin of error in the individual’s test scores.

72
Q

What does it mean if the standard error of measurement (SEM) is a big number?

A

The scores are really spread out. The more spread out your scores are, the less certain you are about what the real value of what you’re measuring is.

73
Q

If a client has only taken the test once, then the best guess of where the middle of the distribution is would be the actual score they obtained on the test. T/F

A

True.

74
Q

If the middle of the distribution is the actual score, how can we estimate the margin of error (confidence interval) in someone’s score.

A

By adding and subtracting the SEM (standard error of measurement) from their actual score.

75
Q

What percentage of individual’s scores will be within 1 SEM of the true score (+/-1 standard deviation).

A

68%

76
Q

What percentage of individual’s scores will be within 2 SEM of the true score (+/-2 standard deviation).

A

95%

77
Q

What percentage of individual’s scores will be within 3 SEM of the true score (+/-3 standard deviation).

A

99.7%

78
Q

What is the margin of error that we usually report?

A

95%, +/- 2 standard deviations

79
Q

We can directly measure SEM, since we can test every client multiple times. T/F

A

False. We can only ever estimate. Hard to test one client thousands of times.

80
Q

How do you calculate the estimation of SEM (Standard Error of Measurement)?

A

Standard deviation of a lot of test-taker’s. Multiplied by the square root of 1 minus the reliability of the test.

81
Q

What assumptions do we make when estimating the SEM?

A

We assume that the SEM will be the same for everyone who takes the test (in real tests, they sometimes give a different SEM for different age groups etc. in manual)

82
Q

Why do we have to add and subtract double the SEM from an individual’s score in order to get the 95% confidence interval?

A

95% confidence interval is around 2 SEM from the individual’s score or the mean.

83
Q

Confidence Interval

A

Range of scores that is likely to contain a person’s true score (margin of error)

84
Q

When assuming a normal distribution, 95% confidence interval is +/- 2 SD (1.96 to be precise). T/F

A

True

85
Q

What does 95% confidence interval mean?

A

95% of scores fall within 2 SD of the ‘mean’ which is their actual score. Therefore the actual score is +/- (2 X SEM)

86
Q

The WAIS IQ test reliability is 0.98 and SD is 15. What is the SEM? And if someone gets an IQ score of 105, their confidence interval is?

A

SEM is 2.12. (15 x square root of 1-0.98)

105 +/- (2x2.12). Therefore scores range from 101 to 109 (i.e. their true IQ score is 95% likely to be in that range).

87
Q

What is the standard error of the difference used for and how do you use it?

A

A measure used to calculate whether there is a statistically significant difference between 2 scores.

At 95% confidence interval, scores need to differ by 2 SEdiff.

88
Q

Standard error of the difference (SEdiff) is used to work out whether someone’s score is significantly different from:

A
  1. Their own score on the same test at a different time (e.g. clinical psychologist - intervention had significant effect on client’s happiness?)
  2. Their score on another test of the same thing.
  3. Someone else’s score on the same test
  4. Someone else’s score on another test
89
Q

We calculating SEdiff for two tests, we need to first transform them to the same scale. T/F

A

True. e.g. a z score

90
Q

If you’re comparing scores on the same test, does SEM1 = SEM2 and r1 = r2?

A

Yes. (i.e. just put the same values in both places in the formula.

91
Q

To be 95% confident that two individual scores are different then they would have to differ by at least 2 standard error of the differences (1.96 to be precise). Why?

A

The reason that it’s 2 standard error the differences is because of the normal distribution.

92
Q

Two scores differ by more than 2 SEdiff (double the SEdiff)

A

We can say that the two scores are significantly different from one another at a 95% confidence level.

93
Q

Two scores differ by less than 2 SEdiff

A

Score are NOT significantly different from one another at a 95% level of confidence.

94
Q

Give a real world example of how SEdiff can be used.

A

Clinical psychologist measuring client happiness before intervention, then after intervention. Change in client’s happiness score needs to be more than 2 SEdiff for it to be significant (there was a change).

95
Q

If two individuals scores on a test were 125 and 134, and the reliability of the test is 0.92 and SD is 14, are the scores significantly different?

A

Calculate SEdiff = 5.6
Difference between score 1 and score two is 9 units.
Need to score 2 times SEdiff (11.2) to be significantly different.

So you can’t tell them apart with this test.

96
Q

What is the reliable change index?

A

Pretty much the same thing as SEdiff except the formula is different and more commonly used in clinical settings.

97
Q

If two individuals scores on a test were 125 and 134, and the reliability of the test is 0.92 and SD is 14. Using the reliable change index (RCI), how would you calculate this?

A

Work out difference between two scores (so 9 units) and divide by the Standard Error of the Difference. If the Reliable Change Index is greater than 1.96 (i.e. 2 standard errors of the difference) then you have a statistically significant change.

98
Q

What is internal consistency?

A

The average correlation between the items on your scale.

In other words, to what extent do individuals give consistent responses across items if all the items in your scale are measuring the same thing.

99
Q

What are the performance measures used in the Neale Analysis of Reading?

A

Reading accuracy, reading rate and comprehension.