Chapter 4-reliability Flashcards

1
Q

Classical Test Theory (CTT): Assumptions (4)

A

(1) Each person has a true score that would be obtained if there were no errors in measurement. Observed test score (X) = True test score (T) + Error (E)
(2) Measurement errors are random
(3) Measurement error is normally distributed
(4) Variance of OBSERVED scores = Variance of true scores + Error variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Reliable test

A

One we can trust to give us the same score for a person every time it is used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Can a measurement instrument be perfectly reliable?

A

No. No measurement instrument is perfectly reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A person’s true score def

A

The hypothetical or ideal measure of a person’s attribute we aim to capture with a psychological test.
=> Free from error
Expected score over an infinite number of independent administrations of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Independent administration def

A

Each time the test is taken is unrelated to previous or future administrations
-> The person’s performance on one occasion doesn’t influence their performance on another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Mean error of measurement = ____
Errors are _______ with each other
True scores and errors are _______

A

0; uncorrelated; uncorrelated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Two tests are parallel if: (3)

A

(1) Equal observed score means
-> Comes from the assumption that True scores would be the same
(2) Equal error variance
(3) Same correlations with other tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Random error characteristics (3)

A

(1) Random in nature
(2) Cancels itself out
(3) Lowers reliability of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Systematic error characteristics (2)

A

(1) Occurs when source of error always increase or decrease a true score
(2) Doesn’t lower reliability of a test since the test is reliably inaccurate by the same amount each time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sources of Measurement Error (3)

A

(1) Content Sampling Error
(2) Time Sampling Error
(3) Other Sources of Error (e.g. observer differences)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Content Sampling Error characteristics (3)

A

(1) Results from differences between the sample of items (i.e., the test) and the domain of items (i.e., all the possible items)
(2) When test items may not be representative of the domain from which they are drawn
(3) Low when test items are representative of the domain

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Time Sampling Error characteristics (2)

A

(1) Results by the choice of a particular time to administer the test
(2) Random fluctuations in performance from one situation or time to another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Other Sources of Error characteristics (2)

A

(1) Scoring or administrative error
E.g., Adding scores with one another
(2) Tests scored or graded by different scorers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability Coefficient

A

Proportion of observed test scores accounted for by variability in true scores.
=> Ratio of the variance of the true scores on a test to the variance of the observed scores
=> Measure of the accuracy of a test or measuring instrument obtained by measuring the same individuals twice and computing the correlation of the two sets of measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Standard Error of Measurement (SEM) def

A

Indicates the amount of uncertainty or error expected in an individual’s observed test score.
=> Corresponds to the SD of the distribution of scores one would obtain by repeatedly testing a person.
=> SD of the distribution of random errors around the true score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard Error of Measurement (SEM) allows us to quantify the _______.

A

Amount of variation in a person’s observed score that measurement error would most likely cause

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

High Reliability = ___ SEM
Low Reliability = ___ SEM

A

High reliability = Low SEM
Low reliability = High SEM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Confidence Interval (CI) def

A

Confidence interval (CI) is a range of scores that we feel confident will include the true score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

CI is used to compare scores to avoid ______________.

A

over-emphasizing differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Reliability of test can be increased by _______.

A

adding items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Spearman-Brown formula def

A

Predicts the effect of lengthening or shortening a test on reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Test reliability is usually estimated with what methods? (4)

A

(1) Test-retest
(2) Alternate (Parallel) Forms
(3) Internal consistency
(4) Interrater/Raters

23
Q

Test-Retest method

A

Administer the same test to the same group of examinees at 2 different occasions Correlate first set of scores with second set of scores.
-> Level of reliability should match construct
-> Higher when construct being measured is expected to be stable than when construct expected to change
Time sampling.

24
Q

Alternate (Parallel) Forms method

A

Evaluate the test across different forms of the test.
Construct two similar forms of a test; administering both forms to the same group of examinees within a very short period of time.
-> Correlate the 2 sets of scores
-> The correlation coefficient as an index of reliability of either one of the forms.
Item sampling

25
Q

Internal Consistency method

A

We examine how people perform on similar subsets of items. Selected from the same form of the measure.
One Test administration: A single form of a test is administered only once to a group of examinees.
=> How consistently the examinees performed across items or subsets of items on this single test form.

26
Q

Internal Consistency Validity - If scores are consistent across items on the same test form, then (2)

A

(1) Items came from the same content domain and constructed the same way
(2) Performance would generalize to other items from the same content domain

27
Q

How High Should Internal Consistency Coefficients Be? (*confond pas avec d’autres coeff)

A

Higher for “narrow” constructs
Lower for “broader constructs
-> Very high may indicate insufficient sampling in the domain
E.g. Medium internal consistency is bad for a narrow construct (panic disorder), but not so bad for a broad construct (Neuroticism)

28
Q

What’s the older approach used to estimate the internal consistency of a test?

A

Correlate scores based on first half or second half of items.
Correlate scores based on odd or even items.
=> If the items get progressively more difficult, then you might be better advised to use the odd-even system, whereby one subscore is obtained for the odd-numbered items in the test and another for the even-numbered items.

29
Q

What’s the contemporary approach used to estimate the internal consistency of a test?

A

Cronbach’s alpha. Contemporary approach to estimate internal consistency.
-> Most general method of finding estimates of reliability through internal consistency.
Cronbach’s alpha = Average of all possible split-half reliabilities
Unaffected by how items are arranged in the test

30
Q

Kappa formula

A

Interrater Agreement
Proportion of the potential agreement following correction for chance agreement.

31
Q

Interrater reliability

A

Degree of agreement among independent observers who rate/assess the same phenomenon.
Two or more people rate/score the same tests

32
Q

Domain Sampling Model

A

Model that holds that the TRUE score of a characteristic is obtained when ALL of the ITEMS in the domain are used to capture it.
Considers the problems created by using a limited number of items to represent a larger and more complicated construct.
Conceptualizes reliability as the ratio of the variance of the observed score on the shorter test and the variance of the long-run true score.

33
Q

Item Response Theory vs Classical Test Theory

A

Classical test theory: Requires that exactly the same test items be administered to each person (but some are too easy, some are too hard).
Item response theory (IRT): Using IRT, the computer is used to focus on the range of item difficulty that helps assess an individual’s ability level. -> The overall result is that a more reliable estimate of ability is obtained using a shorter test with fewer items.

34
Q

Difficulties with applications of IRT

A

Requires a bank of items that have been systematically evaluated for level of difficulty.
-> Considerable effort must go into test development, and complex computer software is required.

35
Q

Test-Retest Method: Problems

A

Carryover effects: Occurs when the first testing session influences scores from the second session.

36
Q

When there are carryover effects, the test-retest correlation usually ________ the true reliability.

A

overestimates
-> This can happen because the participant remembers items or patterns from the first test, so their performance on the second test is less independent than it should be.

37
Q

In cases where the changes are ___, carryover effects do not harm the reliability.

A

systematic
E.g. When everyone’s score improves exactly 5 points. In this case, no new variability occurs.

38
Q

Important type of carryover effect

A

Practice effects: improvment with practice
-> Because of these problems, the time interval between testing sessions must be selected and evaluated carefully.

39
Q

What method provides one of the most rigorous assessments of reliability commonly in use?

A

Parallel Forms Method
-> However: Test developers find it burdensome to develop two forms of the same test, and practical constraints make it difficult to retest the same group of individuals.
-> Many test developers prefer to base their estimate of reliability on a single form of a test.

40
Q

Problems with Split-Half method (2)

A

(1) The two halves may have different variances.
(2) The split- half method also requires that each half be scored separately, possibly creating additional work.

41
Q

What technique avoids the problems of split-half method and how?

A

The Kuder-Richardson technique avoids these problems because it simultaneously considers all possible ways of splitting the items.

42
Q

KR20 Formula

A

Simultaneously considers all possible ways of splitting the items. When test is dichotomous (Right or Wrong)

43
Q

Factor Analysis

A

Factor analysis is one popular method for dealing with the situation in which a test apparently measures several different characteristics.
-> Used to divide the items into subgroups, each internally consistent;
E.g. test that has submeasures for several different traits.

44
Q

Sources of measurement error: (3)

A

(1) Time sampling: The same test given at different points in time may produce different scores, even if given to the same test takers.
(2) Item sampling: The same construct or attribute may be assessed using a wide pool of items.
(3) When different observers record the same behavior: Different judges observing the same event may record different numbers.

45
Q

How do we assess measurement error associated with time sampling?

A

Test-retest method (coeff of stability)

46
Q

How do we assess measurement error associated with item sampling?

A

Parallel forms reliability

47
Q

How do we assess measurement error associated with “diff people judge same behavior”?

A

Adjusted index of agreement such as the kappa statistic.

48
Q

What to Do about Low Reliability? (3)

A

(1) Increase the Number of Items
(2) Throw out items that run down the reliability (by running a factor/discriminability analysis)
(3) Estimate what the true correlation would have been (correction for attenuation)

49
Q

How to interpret the Kappa stat?

A

Kappa = 0 is considered poor -> means the agreement is basically by chance.
Kappa = 1 represents perfect, complete agreement.

50
Q

When random error is high on both tests, the correlation between the scores will be _____ compared to when the random error is ___.

A

lower; small

51
Q

Formulas for diff methods of reliability:
- Test-retest (time sampling error)
- Alternate forms (item sampling error)
- Internal consistency (item sampling)
- Interrater (observer difference)

A

Test-retest = Pearson’s (coeff stability)
Alternate forms = Pearson’s (coeff equivalence)
Internal consistency = Spearman- Brown formula, 𝐾𝑅20 formula, or Cronbach alpha
Interrater = Kappa formula

52
Q

Analysis for diff methods of reliability:

A

Test-retest = Scores on first and second administration correlated
Alternate forms = Scores on both tests are correlated
Internal consistency = Scores on both halves are correlated OR Average correlation on all split halves
Interrater = Scores by both observers are correlated

53
Q

Difference Score def

A

Subtracting one test score from another
-> two different attributes

54
Q

Why are difference score unreliable?

A

Difference scores are unreliable because the random error from both scores is compounded and the true score is cancelled out.