Test Construction Flashcards
Some test experts use the term _____________ to refer to the extent to which test items contribute to achieving the stated goals of testing.
relevance
A determination of relevance is based on a qualitative judgement that takes into account which factors?
Content appropriateness (Does the item actually assess the content or behavior domain that the test is designed to evaluate?) Taxonomic level (Does the item reflect the appropriate cognitive or ability level?) Extraneous abilities (To what extent does the item require knowledge, skills, or abilities outside the domain of interest?)
An item’s difficulty is measured by calculating an item difficulty index (p), which is what equation?
The value of p ranges from 0 to 1.0, with larger values indicating easier items. When p is equal to 1.0, this means the item was answered correctly by all examinees; when p is 0, this indicates that none of the examinees answered the item correctly.
In most situations, a p value of _____ is optimal. One exception is the case of a true/false test, for which the optimal p value is _____.
.50; .75
______________________ refers to the extent to which a test item differentiates between examinees who obtain high versus low scores on the entire test or on an external criterion.
Item discrimination
The item discrimination index ranges from _____ to _____.
-1.0; +1.0
For most tests, an item with a discrimination index of _____ or higher is considered acceptable.
.35
When using item response theory, an ______________________ is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically-derived estimate of a latent ability or trait.
Item characteristic curve (ICC)
The theory of measurement that regards observed variability in test scores as reflecting two components: true differences between examinees on the attributes measured by the test and the effects of measurement (random) error
Classical test theory
______________ is a measure of true score variability. It reforest to the consistency of test scores; i.e., the extent to which a test measures an attribute without being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms.
Reliability
When a test is ____________, it provides dependable, consistent results and, for this reason, the term consistency is often given as a synonym.
Reliable
What are some methods for establishing reliability?
test-retest, alternative forms, split-half, coefficient alpha, and inter-rater
Most methods for estimating reliability produce a ______________________, which is a correlation coefficient that ranges in value from 0.0 to 1.0.
Reliability coefficient
What does it mean if a test’s reliability coefficient is 0.0?
All variability in obtained test scores is due to measurement error.
When a test’s reliability coefficient is 1.0, this indicates that all variability reflects what?
True score variability
If a test has a reliability coefficient of .91, this means that ____% of variability in obtained test scores is due to ______________ variability, while the remaining 9% reflects _____________.
91; true score; measurement error
Match the method for estimating reliability to the correct definition: a. Test-Retest Reliability b. Alternate (Equivalent, Parallel) Forms Reliability c. Internal Consistency Reliability d. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability 1. ___ To assess this, two equivalent forms of the test are administered to the same group of examinees and the two sets of scores are correlated. Indicates the consistency of responding to different item samples and, when the forms are administered at different times, the consistency of responding over time. 2. ___ Involves administering the same test to the same group of examinees on two different occasions and then correlating the two sets of scores. It is used for determining the reliability of tests designed to measure attributes that are relatively stable over time and that are not affected by repeated measurement (i.e., aptitude). Most thorough. 3. ___Split-half reliability and coefficient alpha are two methods for evaluating this. Both involve administering the test once to a single group of examinees and is useful when a test is designed to measure a single characteristic, when the characteristic measured by the test fluctuates over time, or when scores are likely to be affected by repeated exposure to the test. 4. ___ Is of concern whenever test scores depend on a rater’s judgement. It’s assessed either by calculating a correlation coefficient or by determining the percent of agreement between two or more raters.
- b 2. a 3. c 4. d
Link the term that belong together: a. Spearman-brown formula b. KR-20 c. Kappa statistic 1. Inter-rater reliability 2. Split-half reliability 3. Coefficient alpha
- c 2. a. 3. c