Chapter 5: Reliability Flashcards

1
Q

_____ is a synonym for dependability or consistency.

A

Reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A _____ is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.

A

reliability coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recall from our discussion of _____ that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.

A

classical test theory

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A statistic useful in describing sources of test score variability is the _____(σ2)—the standard deviation squared.

A

variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Variance from true differences is true variance, and variance from irrelevant, random sources is _____.

A

error variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The term _____ refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more _____ the test.

A

reliability/reliable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

1) Test construction
2) Test administration
3) Test scoring and interpretation
4) Other sources of error: Underreport and Overreport

A

Sources of Error Variance (4)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Sources of Error Variance:

One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests.

A

Test construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Sources of Error Variance:

test environment: the room temperature, the level of lighting, and the amount of ventilation and noise, for instance.

testtaker variables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance.

Examiner-related variables: physical appearance, demeanor, presence, absence, oral exam emphasizing key words, nonverbal cues when correctness

A

Test administration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sources of Error Variance:

The advent of computer scoring and a growing reliance on objective, computer-scorable items virtually have eliminated error variance caused by scorer differences in many tests. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance.

A

Test scoring and interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Reliability Estimates (4)

A

1) Test-Retest Reliability Estimates
2) Parallel-Forms and Alternate-Forms Reliability Estimates
3) Split-Half Reliability Estimates
4) Other Methods of Estimating Internal Consistency:
a) Inter-item consistency
b) The Kuder-Richardson formulas
c) Coefficient alpha

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Reliability Estimates:

using the same instrument to measure the same thing at two points in time.

is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.

**The passage of time can be a source of error variance. The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower.

**even when the time period between the two administrations of the test is relatively small, various factors (such as experience, practice, memory, fatigue, and motivation) may intervene and confound an obtained measure of reliability

A

Test-Retest Reliability Estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Reliability Estimates:

it is referred to as an internal consistency estimate of reliability or as an estimate of inter-item consistency.

Ex. Both groups take both tests: group A takes test A first, and group B takes test B first. The results of the two tests are compared, and the results are almost identical, indicating high parallel forms reliability.

Put simply, you’re trying to find out if test A measures the same thing as test B.

source of error variance: item sampling
cons: time-consuming and expensive.

A

Parallel-Forms and Alternate-Forms Reliability Estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reliability Estimates:

is obtained by correlating two pairs of scores obtained from
equivalent halves of a single test administered once.

One acceptable way to _____ is to randomly assign items to one or the other half of the test.

odd-even reliability

**The Spearman-Brown formula

A

Split-Half Reliability Estimates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Reliability Estimates: Other Methods of Estimating Internal Consistency

refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of interitem consistency, in turn, is useful in assessing the homogeneity of the test.

Tests are said to be homogeneous if they contain items that measure a single trait.

The more homogeneous a test is, the more _____ it can be expected to have.

A

Inter-item consistency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Reliability Estimates: Other Methods of Estimating Internal Consistency

Dissatisfaction with existing split-half methods of estimating reliability compelled to develop their own measures for estimating reliability.

a measure of internal consistency reliability for measures with dichotomous choices

**coefficient alpha or coefficient α-20.

A

The Kuder-Richardson formulas (Kuder-Richardson formula 20, or KR-20)

17
Q

Reliability Estimates: Other Methods of Estimating Internal Consistency

In contrast to KR-20, which is appropriately used only on tests with dichotomous items, _____ is appropriate for use on tests containing “nondichotomous items”.

is the preferred statistic for obtaining an estimate of internal consistency reliability. Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test.

A

Coefficient alpha

18
Q

Reliability Estimates: Other Methods of Estimating Internal Consistency

Unlike a Pearson r, which may range in value from -1 to +1, coefficient alpha typically ranges in value from _____.

A

0 to 1

19
Q

Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.

problem is a lack of clarity in scoring criteria, solution: rewrite the scoring criteria section of the manual to include clearly written scoring rules, group discussion, and practice exercises and information on rater accuracy.

A

Measures of Inter-Scorer Reliability

20
Q

Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a _____.

A

coefficient of inter-scorer reliability

21
Q

Using and Interpreting a Coefficient of Reliability

Three approaches to the estimation
of reliability: (3)

A

1) test-retest
2) alternate or parallel forms, and
3) internal or inter-item consistency

22
Q

Another question that is linked in no trivial way to the purpose of the test is, “How high should the coefficient of reliability be?” Perhaps the best “short answer” to this question is: “On a continuum relative to the purpose and importance of the decisions to be made on the basis of _____ on the test”.

A

scores

23
Q

The Nature of the Test

considerations concerning the purpose and use of a reliability coefficient are those concerning the nature of the test itself: (5)

A

1) test items are homogeneous or heterogeneous in nature;
2) the characteristic, ability, or trait being measured is presumed to be dynamic or static;
3) the range of test scores is or is not restricted;
4) the test is a speed or a power test; and
5) the test is or is not criterion-referenced

24
Q
  1. Test-retest, 2, 1, administration, Pearson r or Spearman rho
  2. Alternate-forms, 1 or 2, 2, test construction or administration, Pearson r or Spearman rho
  3. Internal Consistency, 1, 1, test construction, Pearson r
  4. Inter-scorer, 1, 1, scoring and interpretation, Pearson r or Spearman rho
A

Type of reliability, # of test forms, source of Error Variance, Statistical procedure

25
Q

Homogeneity vs heterogeneity of test items

A
homogenous = high internal consistency
heterogenous = low internal consistency
26
Q

Dynamic vs static characteristics

A
Dynamic = test-retest
Static = test-retest or alternate forms
27
Q

Restriction or inflation of range

A
Restricted = low correlation coefficient
Inflation = high correlation coefficient
28
Q

Speed tests vs power tests

A
Power = long time limit, difficult items
Speed = fair difficulty or low, test-retest, alternate-forms, split-half timed test
29
Q

A _____ is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective.

A

Criterion-referenced tests

30
Q

Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in a hierarchical fashion.

Scores on criterion-referenced tests tend to be interpreted in _____ (or, perhaps more accurately, “master-failed-to-master”) terms, and any scrutiny of performance on individual items tend to be for diagnostic and remedial purposes

A

pass–fail

31
Q

Alternatives to the True Score Model (2)

A

1) Generalizability theory

2) Item response theory

32
Q

Alternatives to the True Score Model:

According to _____, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score, and it is, as Cronbach noted, analogous to a true score in the true score model.

examines how generalizable scores from a particular test are if the test is administered in different situations.

A

generalizability theory (G theory)

33
Q

Alternatives to the True Score Model:

_____ procedures provide a way to model the probability that a person with X ability will be able to perform at a level of Y.

because the construct being measured may be a trait (it could also be something else, such as an ability), a synonym for _____ in the academic literature is latent-trait theory.

Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination;

A

Item response theory

34
Q

two characteristics of items within an IRT framework

A

“Difficulty” in this sense refers to the attribute of not being easily accomplished, solved, or comprehended.

In the context of IRT, discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured.

35
Q

Reliability and Individual Scores (2)

A

1) The Standard Error of Measurement (SEM)

2) The Standard Error of the Difference between Two Scores

36
Q

Reliability and Individual Scores:

provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement.

The higher the reliability of a test (or individual subtest within a test), the lower the _____.

is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the _____ as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests.

Also known as the standard error of a score and denoted by the symbol σmeas , the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel.

A

Standard Error of Measurement

37
Q

Reliability and Individual Scores:

Comparisons between scores are made using the standard error of the difference, a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.

A

Standard Error of the Difference between Two Scores