Reliability Flashcards

1
Q

TEST-RETEST

A

We consider the consistency of the test results when the test is administered on different occasions

only applies to stable traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Sources of difference between test and retest?

A

Systematic carryover - everyones score improved the same amount of points - does not harm reliability

Random Carryover - changes are not predictable from earlier scores or when something affects some but not all test takers

Practice effects - skills improve with practice
Midterm exam twice - expect you to do better-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Time before re-administration must be carefully evaluated

A

Short time: carryover and practice effects

Long time: poor reliability, change in the characteristic with age, combination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Well-evaluated test: test-retest

A

Well-evaluated test - many retest correlations associated with different time intervals between testing sessions - consider events in between

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

PARALLEL FORMS

A

we evaluate the test across different forms of the test

use different items; however, the rules used to select items of a particular difficulty level are the same.

Give two different forms to the same person (same day), calculate the correlation
Reduces learning effect

CON: not always practical - hard to come up with two forms that you expect to behave identically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

SPLIT HALF/Internal Consistency

A

Administer the whole test - split it in half and calculate the correlation between halves

If progressively more difficult - even odd system

CON: how do you figure out which halves? - midterm - don’t expect all questions to be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

SPLIT HALF: Spearman-Brown Correction

A

allows you to estimate what the correlation
between the two halves would have been if each half had been the length of the whole test:

R = 2r/1+r

Corrected r = the estimated correlation between the two halves of the test if each had the total number of items

increases the estimate of reliability

r = the correlation between the two halves of the test

Assuming variance btw the two halves are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

SPLIT HALF: Cronbach’s Alpha

A

The coefficient alpha for estimating split-half reliability

LOWEST boundary for reliability

Unequal variances

A = the coefficient alpha for estimating split-half reliability
O2x = the variance for scores on the whole o2y1o2y2 = the variance for the two separate halves of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SPLIT HALF: KR20 formula

A

Reliability estimate - math as a way of solving the problem for all possible split halves

S2 = the variance of the total test score
P = the proportion of the people getting each item correct (this is found separately for each item)
Q = the proportion of people getting each item incorrect. For each item, q equals 1-p.
Sumpq = sum of the products of p times q for each item on the test

to have nonzero reliability, the variance for the total test score must be greater than the sum of the variances for the individual items.
This will happen only when the items are measuring the same trait.

The total test score variance is the sum of the item variances and the covariances between items
only situation that will make the sum of the item variance less than the total test score variance is when there is covariance between the items

greater the covariance, the smaller the Spq term will be.
When the items covary, they can be assumed to measure the same general trait, and the reliability for the test will be high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

KR20 formula cons split half

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

split half: KR21 Formal

A

Similar - different version

does not require the calculation of the p’s and q’s for every item. Instead, the KR21 uses an approximation of the sum of the pq products—the mean test score

Assumptions need to be met:
most important is that all the items are of equal difficulty, or that the average difficulty level is 50%.

Difficulty is defined as the percentage of test takers who pass the item. In practice, these assumptions are rarely met, and it is usually found that the KR21 formula underestimates the split-half reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

SPLIT HALF: Coefficient Alpha

A

Variance of all individual items compared to variance of test score

Tests where there is no correct answer - likert

Similar to the KR20 - sumpq - replaced by sums2i = variance of the individual items -sum individual variances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Factor ANalysis

A

Can be used to divide the items into subgroups, each internally consistent - subgroups of items will not be related to one another

Help a test constructor build a test tha has submeasures for several different traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Classical test theory - turning away bc

A
  1. Requires that exactly the same test be administered to each person
  2. Some items are too easy and some are too hard - so few of the items concentrate on a persons exact ability level
  3. Assumes behavioral dispositions are constant over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Item Response Theory

A

Basis of computer adaptive tests
focus on the range of item difficulty that helps assess an individual’s ability level.

turning away from the classical test theory for a variety of different reasons.

IRT, the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. For example, if the person gets several easy items correct, the computer might quickly move to more difficult items.

more reliable estimate of ability is obtained using a shorter test with fewer items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Item Response Theory - Difficulties

A

1 - method requires a bank of items that have been systematically evaluated for level of difficulty

2- Considerable effort must go into test development, and complex computer software is required.

17
Q

Reliability of a Difference Score

A

When might we want a difference score - difference btw performance at two points in time, before and after a training program

In a difference score, E is expected to be larger than either the observed score or T because E absorbs error from both of the scores used to create the difference score.

T might be expected to be smaller than E because whatever is common to both measures is canceled out when the difference score is created

18
Q

The low reliability of a difference score should concern

A

The low reliability of a difference score should concern the practicing psychologist and education researcher. Because of their poor reliabilities, difference scores cannot be depended on for interpreting patterns.

19
Q

Interrater Reliability
Kappa statistic

A

introduced by J. Cohen (1960) as a measure of agreement between two judges who each rate a set of objects using nominal scales.

Kappa indicates the actual agreement as a proportion of the potential agreement following correction for chance agreement.

Values of kappa may vary between 1 (perfect agreement) and 21 (less agreement than can be expected on the basis of chance alone).
Greater than .75 = excellent

20
Q

Interrater Reliability - Nominal scores

A

-1 less than chance
1 perfect agreement
.75 excellent
.40-.70 - fair to good
Less than .40 is poor

21
Q

Sources of Error
Time Sampling issues

A

week later in state anxiety

source of error is typically assessed using the test–retest method

22
Q

Sources of Error - Item sampling

A

some items may behave strangely
The same construct or attribute may be assessed using a wide pool of items.

Typically, the correlation between two forms of a test is created by randomly sampling a large pool of items believed to assess a particular construct.

This correlation is used as an estimate of this type of reliability

23
Q

Sources of Error Internal Consistency -

A

we examine how people perform on similar subsets of items selected from the same form of the measure
intercorrelations among items within the same test

If the test is designed to measure a single construct and all items are equally good candidates to measure that attribute, then there should be a high correspondence among the items.

24
Q

determine extent of internal consistency error by

A

evaluated using split-half reliability, the KR20 method, or coefficient alpha

25
Q

Observer Differences - sources of error

A

untrained person, independent observations reconciled

Even though they have the same instructions, different judges observing the same event may record different numbers.

To determine the extent of this type of error, researchers can use an adjusted index of agreement such as the kappa statistic.

26
Q

Standard error of measurement

A

Because we usually assume that the distribution of random errors will be the same for all people, classical test theory uses the standard deviation of errors as the basic measure of error.

tells us, on average, how much a score varies from the true score.

In practice, the standard deviation of the observed score and the reliability of the test is used to estimate the standard error of measurement

SDsqrt1-r

Not the standard error of the mean
Includes info about the reliability of the test
Use this measure to construct a confidence interval around a specific score

Upper bound = score +/- 1.96*SM
Bounds are around the observed score
1.96 multiplied by standard error of measurement - 95% CI
Tells me that X is my observed score, 95%, the true score will be within those boundaries

27
Q

How much reliability is good enough?

A

Depends on what you are using it for

High-stakes consequences - need to have a good idea of how reliable it is

Reliability estimates in range of .7-.8 are good enough for most purposes in basic research - Some people have argued that it would be a waste of time and effort to refine research instruments beyond a reliability of .90.
In fact, it has even been suggested that reliabilities greater than .95 are not very useful because they suggest that all of the items are testing essentially the same thing and that the measure could easily be shortened.

For a test used to make a decision that affects some person’s future, evaluators should attempt to find a test with a reliability greater than .95.

28
Q

What to do about low reliability?

A

Add items -

always makes it more reliable - if she gives one multiple choice question vs. 40 item, single item would not be reliable
Larger the sample - more likely that the test will represent the true characteristic

Item analysis

  • go in and test how all individual items are doing - which ones are doing well
    Each item in a test is an independent sample of the trait or ability being measured
29
Q

Length Needed for any Desired Level of Reliability

A

N = the number of tests of the length of the current version that would be needed
Rd = desired reliability
R0 = observed reliability based on the current version of the test

30
Q

Correction for Attenuation

A

If a test is unreliable, information obtained with it is of little or no value. Thus, we say that potential correlations are attenuated, or diminished, by measurement error.

True correlation btw tests 1 and 2 - estimate of true correlation between tests

OBserved will be an underestimate
If we could get everyone’s true scores

R12hat = the estimated true correlation btw tests 1 and 2
R12 = the observed correlation btw 1 and 2
R11 = reliability of test 1
R22 = reliability of test 2

Another way if you are concerned about one of your tests being unreliable

31
Q

Domain Sampling Model

A

considers the problems created by using a limited number of items to represent a larger and more complicated construct
Use a sample

task in reliability analysis is to estimate how much error we would make by using the score from the shorter test as an estimate of your true ability.

reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score.
the greater the number of items, the higher the reliability.

Because true scores are not available, our only alternative is to estimate what they would be. Given that items are randomly drawn from a given domain, each test or group of items should yield an unbiased estimate of the true score.

Different random samples of items might give different estimates of the true score
To estimate reliability, we can create many randomly parallel tests by drawing repeated random samples of items from the same domain

32
Q

Variance of Scores

A

Imagine a bunch of people taking the same test
Everyone has their own TRUE score (theoretical)
Everyone has their own observed scores

Variance = square of SD

We can calculate the variance of the observed scores
Theoretically, we could also imagine the variance of the true scores
Which would be bigger? - observed or true
Observed scores variance is larger - add error variance

33
Q

Which would be bigger? - observed or true variance

A

Observed scores variance is larger - add error variance

34
Q

Random vs. Systematic error

A

No error - bullseye

Random error - scattered spots in middle - we have accuracy but not precision

Systematic error - not a lot of variance but error is not randomly distributed - cluster somewhere else - precision but not accuracy

Practice effect - expect score to be different the second time - not-random - get better

Test that underestimates the ability of women - gave a systematically lower score
Observed score tends to be lower than true score

35
Q

Reliability Coefficient

A

Ratio of the variance of the true scores on a test to the variance of the observed scores

R = the theoretical reliability of the test
o2T = the variance of the true scores
O2x = the variance of the observed scores
Use o bc theoretical values in a population rather than those actually obtained from a sample
R = percentage of the observed variation that is attributable to the variation in the true score - 1 = variance attributable to random error

36
Q

Behavioral Observation

A

frequently unreliable because of discrepancies between true scores and the scores recorded by the observer

problem of error associated with different observers presents unique difficulties
estimate the reliability of the observers - interrater

37
Q

record the percentage of times that two or more observers agree.
Not the best for 2 reasons

A

percentage does not consider the level of agreement that would be expected by chance alone - , if two observers are recording whether a particular behavior either occurred or did not occur, then they would have a 50% likelihood of agreeing by chance alone.
A method for assessing such reliability should include an adjustment for chance agreement

percentages should not be mathematically manipulated.
For example, it is not technically appropriate to average percentages. Indexes such as Z scores are manipulable and thus better suited to the task of reliability assessment.

38
Q

TO ensure that items measure the same thing:

A

Factor analysis - tests are most reliable if they are unidimensional - one factor should account for considerably more of the variance than any other factor

Disciminability analysis - when the correlation between the performance on a single item and the total test score is low - item is prob measuring something different from other items on the test