Test Realibility Flashcards
is an index of reliability, a proportion that indicates the ratio between the
true score variance on a test and the total variance
Reliability coefficient
- a score on an ability test reflects not only the testtaker’s true score on the ability being
measured but also error
Classical Test Theory (True Score Theory)
3 Sources of Error Variance
Test Construction
Test Administration
Test Scoring and Interpretation
variance is attributed to item/content sampling
Test Construction
test environment, testtaker variables, examiner-related variables are factors that may
influence testtaker’s attention or motivation
Test Administration
technical glitches, subjectivity of scorer, human error, etc
Test Scoring and Interpretation
§ obtained by correlating pairs of scores from the same people on two different administrations of the same test
§ appropriate when evaluating a test measuring a construct that is relatively stable over time (e.g. personality)
§ coefficient of stability
§ source of error variance: ti
Reliability Estimates (STABILITY)
TEST-RETEST RELIABILITY ESTIMATE
§ two test administrations with the same group of test takers
§ coefficient of equivalence
Reliability Estimates (EQUIVALENCE)
PARALLEL-FORMS and ALTERNATE-FORMS RELIABILITY ESTIMATES
a test exist when, for each versions of the test, the means and
variances of observed test scores are equal.
Parallel-forms
a test are typically designed to be equivalent/identical with
respect to variables such as content and level of difficulty
Alternate-forms
§obtained by correlating two pairs of scores obtained from
equivalent halves of a single test administered once
SPLIT-HALF RELIABILITY ESTIMATE
◦ used to estimate internal consistency reliability from a correlation of two
halves of a test (either lengthened or shortened)
Spearman-Brown formula
Full meaning of KR 20 & 21
KUDER-RICHARDSON FORMULA 20 & 21
used to determine the inter-item consistency of
dichotomous items - items that can be scored right or wrong (e.g.
Multiple-choice, Yes/No, True/False, Agree/Disagree)
KR-20
items that can be scored right or wrong (e.g.
Multiple-choice, Yes/No, True/False, Agree/Disagree)
dichotomous items -
may be used if all the test items have approximately the
same degree of difficulty
KR-21
§most accepted and widely used reliability estimate
§Provides a measure of reliability from a single test administration
§developed by Lee Joseph Cronbach that’s why it is also called
Cronbach’s alpha
appropriate for use on tests containing nondichotomous items
COEFFICIENT ALPHA
Coefficient alpha developed by __________ that’s why it is also called _________
Lee Joseph Cronbach
Cronbach’s alpha
appropriate for use on tests containing ________ items
(Strongly Disagree - Strongly Agree
nondichotomous
§degree of agreement or consistency between two or more scorers
with regard to a particular measure
§scorers must have sufficient training in standardized scoring
§source of error: scoring criteria
§coefficient of inter-scorer reliability
INTER-SCORER RELIABILITY ESTIMATE
Using and Interpreting a Reliability Coefficient
When purchasing tests:
üNever buy any form of assessment/measurement where there is
no reliability
coefficient or where it is below 0.7
Using and Interpreting a Reliability Coefficient
When purchasing tests:
üPersonality and similar measures: ___________ is often
recommended as minimum
0.6 to 0.8 although above 0.7
Using and Interpreting a Reliability Coefficient
When purchasing tests:
üAbility, aptitude, IQ and other forms of reasoning tests should have coefficients
___________has been recommended as an excellent value. Where the
intention is to compare people’s scores, such as when selecting people for a job,
values ______ should be the aim.
above 0.8. Above 0.85
& above 0.85
Using and Interpreting a Reliability Coefficient
When purchasing tests:
üThe sample size used for calculation of reliability should never be _____
below 100
5 Reliability and Nature of the Test
Homogeneity vs. Heterogeneity of test items
Dynamic vs. Static characteristics
Restriction or Inflation of range
Speed tests vs Power tests
Criterion-referenced tests
- uniformity of test items
Homogenous items
- various items measuring multiple
constructs
Heterogenous items
- changing trait, state, or ability (e.g. anxiety)
Dynamic
- stable/enduring trait, state, or ability
Static
Variability of test scores is directly related to correlation coefficient
Restriction or Inflation of range
- reliability estimate of speed tests should be based on performance
from two independent testing periods
Speed tests vs Power tests
- traditional procedures of estimating reliability are usually not
appropriate for use with__________ though there may be instances in which traditional estimates can be adopted
Criterion-referenced tests
3 Alternatives to the True Score Theory or Classical Test Theory
DOMAIN SAMPLING THEORY
GENERALIZABILITY THEORY
ITEM RESPONSE THEORY
§ seek to estimate the extent to which specific sources of variation under
defined conditions are contributing to the test score
§ posits that a test score is a sample from a larger, theoretical “domain” of possible items, and the reliability of a test increases with the number of items sampled from that domain
DOMAIN SAMPLING THEORY
GENERALIZABILITY THEORY
originally referred to as the ________ is a modified form of DST
§ developed by _______
Domain Sampling Theory; GT
Cronbach and colleagues
§ a person’s test scores vary from testing to testing because of variables in
the testing situation
§ given the exact same conditions of all the facets in the universe, the exact
same test score should be obtained
§ test reliability does not reside within the test itself, rather, it is a function of
the circumstances under which the test is developed, administered, and
interpreted
GENERALIZABILITY THEORY
§ a theory of testing based on the relationship between an individual’s performance on a test
item and the test taker’s level of performance on an overall measure of the ability the item
was designed to measure.
§ Persons with lower ability have less of a chance, while persons with high ability are very likely
to answer correctly; for example, students with higher math ability are more likely to get a
math item correct.
ITEM RESPONSE THEORY
ITEM RESPONSE THEORY
IRT models are often referred to as ____________. The term latent is used to emphasize
that discrete item responses are taken to be observable manifestations of hypothesized traits,
constructs, or attributes, not directly observed, but which must be inferred from the manifest
responses
latent trait models
2 Reliability and Individual Scores
STANDARD ERROR OF MEASUREMENT
STANDARD ERROR OF THE DIFFERENCE
- a range or band of test scores that is likely to contain
the true score
confidence interval
often abbreviated as SEM or SEM, provides a measure of precision of an
observed score; an estimate of an amount of error inherent in an observed
§ SEM and reliability of a test has an inverse relationship, that is, the higher the
reliability of a test (or individual subtest within a test), the lower the SEM
§ it can be used to set the confidence interval for a particular score or to
determine whether a score is significantly different from a criterion.
STANDARD ERROR OF MEASUREMENT
SEM
STANDARD ERROR OF MEASUREMENT
used to determine how large a difference should be before it is considered
statistically significant
§ in cases such as recruitment and selection, ________
can be used to compare the test scores of applicants which can help
personnel officers in making hiring decisions
STANDARD ERROR OF THE DIFFERENCE