chapter 5 Flashcards
- Is a synonym for dependability or consistency.
- Refers to consistency in measurement.
Reliability
Is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.
Reliability coefficient
That a score on an ability test is presumed
to reflect not only the testtaker’s true score on the ability being measured but also error.
Classical test theory
- Variance from true differences.
True variance
- A statistic useful in describing sources of test score variability.
- This statistic is useful because it can be broken into components.
- The standard deviation squared.
Variance
Refers to collectively, all of the factors associated
with the process of measuring some variable, other than the variable being measured.
Measurement error
Variance from irrelevant, random sources
Error variance
Is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process.
- Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to
another with no discernible pattern that would systematically raise or lower scores.
Random error
Refers to a source of error in measuring a
variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.
Systematic error
Sources of Error Variance:
Sources of error variance include test construction, administration, scoring, and/or
interpretation.
Terms refer to variation among items within a test as well as to variation among items between tests.
- Under test construction
Item sampling or content sampling
Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one kind of error variance.
- Examples of untoward influences during
administration of a test include factors related to the: room temperature, level of lighting, and amount of ventilation and noise, for instance.
Test environment
- Other potential sources of error variance during test administration are: pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance.
Test-taker variables
The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here.
Examiner-related variables
- In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences.
- However, not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel.
Test scoring and interpretation
- Surveys and polls are two tools of assessment commonly used by researchers who study public opinion.
- Certain types of assessment situations lend themselves to particular varieties of systematic
and nonsystematic error.
Other sources of error
Reliability Estimates:
- Test-Retest Reliability Estimates
- Parallel-Forms and Alternate-Forms Reliability Estimates
- Split-Half Reliability Estimates
Is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
- Is appropriate when evaluating the reliability of a test that purports to measure something
that is relatively stable over time, such as a personality trait.
Test-retest reliability
When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as
Coefficient of stability
The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the
Coefficient of equivalence
- of a test exist when, for each form of the test, the means and the variances of observed test scores are equal.
Parallel forms
Refers an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.
Parallel forms reliability
Are simply different versions of a test that
have been constructed so as to be parallel.
Alternate forms
Refers an estimate of the extent to which these different forms of the same
test have been affected by item sampling error, or other error.
Alternate forms reliability
Deriving this type of estimate entails an evaluation of the internal consistency of the test items. Logically enough, it is referred to as an
Internal consistency estimate of reliability or as an estimate of inter-item consistency
There are different methods of obtaining internal consistency estimates of reliability. One such method is the
Split-half estimate
Is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
- It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test
twice (because of factors such as time or expense).
Split-half reliability
This method yields an estimate of split-half
reliability that is also referred to as
Odd-even reliability
Allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.
Spearman–Brown formula
Refers to the degree of correlation among all the
items on a scale.
Inter-item consistency
- Is the degree to which a test measures a single factor. In other words; the extent to which items in a scale are unifactorial.
- (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”)
Homogeneity
- Describes the degree to which a test
measures different factors. - Test is composed of items
that measure more than one trait.
Heterogeneity, heterogeneous
Dissatisfaction with existing split-half methods of estimating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson & Kuder, 1939) to develop their own measures for estimating reliability.
- 20th formula developed in a series
Kuder–Richardson formula 20
A selected assortment of tests and assessment procedures—in the process of evaluation.
- typically composed of tests designed to measure different variables
Test battery
- Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967)
- May be thought of as the mean of all possible split-half correlations, corrected by the Spearman–Brown formula.
Coefficient alpha
- A relatively new measure for evaluating the internal consistency of a test
- is measure that focuses on the degree of difference that exists between item scores; as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores
Average proportional distance (APD) method
- Is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure.
- often used when coding nonverbal behavior
Inter-scorer reliability
The simplest way of determining the degree of
consistency among scorers in the scoring of a test is to calculate a coefficient of correlation.
Coefficient of inter-scorer reliability
A source of error attributable to variations in the test-taker’s feelings, moods, or mental state over time.
Transient error
Recall that a test is said to be homogeneous
in items if it is functionally uniform throughout. Tests designed to measure one factor, such
as one ability or one trait, are expected to be homogeneous in items. For such tests, it is
reasonable to expect a high degree of internal consistency. By contrast, if the test is
heterogeneous in items, an estimate of internal consistency might be low relative to a more
appropriate estimate of test-retest reliability.
Homogeneity versus heterogeneity of test items
Is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences.
Dynamic characteristics
In using and interpreting a coefficient of reliability, the issue variously referred to as restriction of range or restriction of variance (or, conversely, inflation of range or inflation of variance) is important. If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher.
Restriction or inflation of range
When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a
Power test
Generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly.
Speed test
Is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective.
- tend to contain material that has been mastered in hierarchical fashion
Criterion-referenced tests
Referred to as the true score (or classical) model of measurement.
Classical test theory (CCT)
As a value that according to classical test theory
genuinely reflects an individual’s ability (or trait) level as measured by a particular test. Let’s
emphasize here that this value is indeed very test dependent.
True score
Seeks to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score.
Domain sampling theory
Is based on the idea that a person’s test scores vary from testing to testing because of
variables in the testing situation.
Generalizability theory
Cronbach encouraged test developers and researchers to describe the details of
the particular test situation or universe leading to a specific test score. This universe is
described in terms of its _____, which include things like the number of items in the test,
the amount of training the test scorers have had, and the purpose of the test administration.
Facets
According to generalizability theory, given the exact same conditions of all the facets in
the universe, the exact same test score should be obtained. This test score is the _____ ________.
Universe score
Examines how generalizable scores from a particular test are if the test is administered in different situations.
Generalizability study
The influence of particular facets on the test score is represented by
Coefficients of generalizability
Developers examine the usefulness of test
scores in helping the test user make decisions
Decision study
- Another alternative to the true score model
- A synonym for IRT in the academic literature is latent-trait theory
Item response theory (IRT)
In the context of IRT, it signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured.
Discrimination
Test items or questions that can be answered with only one of two alternative responses, such
as true–false, yes–no, or correct–incorrect questions
Dichotomous test items
Test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct
Polytomous test items
Is a reference to an IRT model with very specific assumptions about the underlying distribution.
Rasch model
- Provides a measure of the precision of an observed test score; provides an estimate of the
amount of error inherent in an observed score or measurement. - Is the tool used to estimate or infer the extent to which an observed score deviates from a true score.
- the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM.
- denoted by the symbol σmeas, the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel. In accordance with the
true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained.
Standard Error of Measurement
A range or band of test scores that is likely to contain the true score
Confidence interval
A statistical measure that can aid a test user in determining how large a difference should be
before it is considered statistically significant
Standard error of the difference