Chapter 5: Reliability Flashcards
_____ is a synonym for dependability or consistency.
Reliability
A _____ is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.
reliability coefficient
Recall from our discussion of _____ that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.
classical test theory
A statistic useful in describing sources of test score variability is the _____(σ2)—the standard deviation squared.
variance
Variance from true differences is true variance, and variance from irrelevant, random sources is _____.
error variance
The term _____ refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more _____ the test.
reliability/reliable
1) Test construction
2) Test administration
3) Test scoring and interpretation
4) Other sources of error: Underreport and Overreport
Sources of Error Variance (4)
Sources of Error Variance:
One source of variance during test construction is item sampling or content sampling, terms that refer to variation among items within a test as well as to variation among items between tests.
Test construction
Sources of Error Variance:
test environment: the room temperature, the level of lighting, and the amount of ventilation and noise, for instance.
testtaker variables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance.
Examiner-related variables: physical appearance, demeanor, presence, absence, oral exam emphasizing key words, nonverbal cues when correctness
Test administration
Sources of Error Variance:
The advent of computer scoring and a growing reliance on objective, computer-scorable items virtually have eliminated error variance caused by scorer differences in many tests. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance.
Test scoring and interpretation
Reliability Estimates (4)
1) Test-Retest Reliability Estimates
2) Parallel-Forms and Alternate-Forms Reliability Estimates
3) Split-Half Reliability Estimates
4) Other Methods of Estimating Internal Consistency:
a) Inter-item consistency
b) The Kuder-Richardson formulas
c) Coefficient alpha
Reliability Estimates:
using the same instrument to measure the same thing at two points in time.
is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
**The passage of time can be a source of error variance. The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower.
**even when the time period between the two administrations of the test is relatively small, various factors (such as experience, practice, memory, fatigue, and motivation) may intervene and confound an obtained measure of reliability
Test-Retest Reliability Estimates
Reliability Estimates:
it is referred to as an internal consistency estimate of reliability or as an estimate of inter-item consistency.
Ex. Both groups take both tests: group A takes test A first, and group B takes test B first. The results of the two tests are compared, and the results are almost identical, indicating high parallel forms reliability.
Put simply, you’re trying to find out if test A measures the same thing as test B.
source of error variance: item sampling
cons: time-consuming and expensive.
Parallel-Forms and Alternate-Forms Reliability Estimates
Reliability Estimates:
is obtained by correlating two pairs of scores obtained from
equivalent halves of a single test administered once.
One acceptable way to _____ is to randomly assign items to one or the other half of the test.
odd-even reliability
**The Spearman-Brown formula
Split-Half Reliability Estimates
Reliability Estimates: Other Methods of Estimating Internal Consistency
refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of interitem consistency, in turn, is useful in assessing the homogeneity of the test.
Tests are said to be homogeneous if they contain items that measure a single trait.
The more homogeneous a test is, the more _____ it can be expected to have.
Inter-item consistency
Reliability Estimates: Other Methods of Estimating Internal Consistency
Dissatisfaction with existing split-half methods of estimating reliability compelled to develop their own measures for estimating reliability.
a measure of internal consistency reliability for measures with dichotomous choices
**coefficient alpha or coefficient α-20.
The Kuder-Richardson formulas (Kuder-Richardson formula 20, or KR-20)
Reliability Estimates: Other Methods of Estimating Internal Consistency
In contrast to KR-20, which is appropriately used only on tests with dichotomous items, _____ is appropriate for use on tests containing “nondichotomous items”.
is the preferred statistic for obtaining an estimate of internal consistency reliability. Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test.
Coefficient alpha
Reliability Estimates: Other Methods of Estimating Internal Consistency
Unlike a Pearson r, which may range in value from -1 to +1, coefficient alpha typically ranges in value from _____.
0 to 1
Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.
problem is a lack of clarity in scoring criteria, solution: rewrite the scoring criteria section of the manual to include clearly written scoring rules, group discussion, and practice exercises and information on rater accuracy.
Measures of Inter-Scorer Reliability
Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a _____.
coefficient of inter-scorer reliability
Using and Interpreting a Coefficient of Reliability
Three approaches to the estimation
of reliability: (3)
1) test-retest
2) alternate or parallel forms, and
3) internal or inter-item consistency
Another question that is linked in no trivial way to the purpose of the test is, “How high should the coefficient of reliability be?” Perhaps the best “short answer” to this question is: “On a continuum relative to the purpose and importance of the decisions to be made on the basis of _____ on the test”.
scores
The Nature of the Test
considerations concerning the purpose and use of a reliability coefficient are those concerning the nature of the test itself: (5)
1) test items are homogeneous or heterogeneous in nature;
2) the characteristic, ability, or trait being measured is presumed to be dynamic or static;
3) the range of test scores is or is not restricted;
4) the test is a speed or a power test; and
5) the test is or is not criterion-referenced
- Test-retest, 2, 1, administration, Pearson r or Spearman rho
- Alternate-forms, 1 or 2, 2, test construction or administration, Pearson r or Spearman rho
- Internal Consistency, 1, 1, test construction, Pearson r
- Inter-scorer, 1, 1, scoring and interpretation, Pearson r or Spearman rho
Type of reliability, # of test forms, source of Error Variance, Statistical procedure
Homogeneity vs heterogeneity of test items
homogenous = high internal consistency heterogenous = low internal consistency
Dynamic vs static characteristics
Dynamic = test-retest Static = test-retest or alternate forms
Restriction or inflation of range
Restricted = low correlation coefficient Inflation = high correlation coefficient
Speed tests vs power tests
Power = long time limit, difficult items Speed = fair difficulty or low, test-retest, alternate-forms, split-half timed test
A _____ is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective.
Criterion-referenced tests
Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in a hierarchical fashion.
Scores on criterion-referenced tests tend to be interpreted in _____ (or, perhaps more accurately, “master-failed-to-master”) terms, and any scrutiny of performance on individual items tend to be for diagnostic and remedial purposes
pass–fail
Alternatives to the True Score Model (2)
1) Generalizability theory
2) Item response theory
Alternatives to the True Score Model:
According to _____, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score, and it is, as Cronbach noted, analogous to a true score in the true score model.
examines how generalizable scores from a particular test are if the test is administered in different situations.
generalizability theory (G theory)
Alternatives to the True Score Model:
_____ procedures provide a way to model the probability that a person with X ability will be able to perform at a level of Y.
because the construct being measured may be a trait (it could also be something else, such as an ability), a synonym for _____ in the academic literature is latent-trait theory.
Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination;
Item response theory
two characteristics of items within an IRT framework
“Difficulty” in this sense refers to the attribute of not being easily accomplished, solved, or comprehended.
In the context of IRT, discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured.
Reliability and Individual Scores (2)
1) The Standard Error of Measurement (SEM)
2) The Standard Error of the Difference between Two Scores
Reliability and Individual Scores:
provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement.
The higher the reliability of a test (or individual subtest within a test), the lower the _____.
is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the _____ as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests.
Also known as the standard error of a score and denoted by the symbol σmeas , the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel.
Standard Error of Measurement
Reliability and Individual Scores:
Comparisons between scores are made using the standard error of the difference, a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant.
Standard Error of the Difference between Two Scores