test construction Flashcards
What is Classical Test Theory?
A theory of measurement used for developing and evaluating tests, also known as true score test theory
What is the formula representing the relationship between obtained test scores, true score variability, and measurement error?
X = T + E
What does true score variability (T) represent?
Actual differences among examinees regarding what the test measures
What is measurement error (E)?
Random factors affecting test performance in unpredictable ways
What are some examples of measurement error?
- Distractions during testing
- Ambiguously worded test items
- Examinee fatigue
What does test reliability refer to?
The extent to which a test provides consistent information
What is a reliability coefficient?
A type of correlation coefficient that ranges from 0 to 1.0
How is a reliability coefficient interpreted?
As the amount of variability in obtained test scores due to true score variability
What reliability coefficient is considered minimally acceptable for many tests?
0.70 or higher
What reliability coefficient is usually required for high-stakes tests?
0.90 or higher
What are the four main methods for assessing a test’s reliability?
- Test-retest
- Alternate forms
- Internal consistency
- Inter-rater
What does test-retest reliability measure?
The consistency of scores over time
How is alternate forms reliability assessed?
By correlating scores from different forms of the test administered to the same examinees
What does internal consistency reliability measure?
The consistency of scores over different test items
Why is internal consistency reliability not useful for speed tests?
It tends to overestimate their reliability
What is coefficient alpha also known as?
Cronbach’s alpha
What is Kuder-Richardson 20 (KR-20) used for?
Evaluating internal consistency reliability for dichotomously scored items
What is the split-half reliability method?
Correlating scores from two halves of a test
What is a drawback of split-half reliability?
It underestimates a test’s reliability
What formula is used to correct split-half reliability?
Spearman-Brown prophecy formula
What does inter-rater reliability assess?
The consistency of scores or ratings assigned by different raters
What methods are used to evaluate inter-rater reliability?
- Percent agreement
- Cohen’s kappa coefficient
What is a limitation of percent agreement in inter-rater reliability?
It does not account for chance agreement
What is consensual observer drift?
Increased consistency (but often decreased accuracy) in ratings due to raters communicating
How can consensual observer drift be reduced?
- Not having raters work together
- Providing adequate training
- Regularly monitoring accuracy
What factor affects the size of the reliability coefficient related to content?
Content homogeneity
Tests that are homogeneous regarding content tend to have larger reliability coefficients than heterogeneous tests, especially for internal consistency reliability.
How does the range of scores influence reliability coefficients?
Larger reliability coefficients occur when test scores are unrestricted in range
This happens when the sample includes examinees with high, moderate, and low levels of the characteristics measured.
What impact does guessing have on reliability coefficients?
Easier guessing leads to lower reliability coefficients
True/false tests are likely less reliable than multiple-choice tests with three or more answer choices.
What is the reliability index?
Theoretical correlation between observed test scores and true test scores
Calculated by taking the square root of the reliability coefficient.
What does an item analysis determine in test development?
Which items to include based on difficulty level and discrimination ability
It is a process used in classical test theory.
How is item difficulty (p) calculated?
p = number of correct answers / total number of examinees
Ranges from 0 to 1.0, with smaller values indicating more difficult items.
What is the preferred range of item difficulty for most tests?
p = .30 to .70
Moderately difficult items are preferred, but optimal values may vary based on the test purpose.
What is the optimal item difficulty level for mastery tests?
Lower p values are preferred
For example, an optimal average item difficulty of .20 might be used to identify mastery of at least 20% of content.
How is the optimal difficulty level for guessing calculated?
Optimal p = (1.0 + probability of guessing) / 2
For a four-answer multiple-choice question, this would be (1.0 + .25) / 2 = .625.
What does the item discrimination index (D) measure?
Difference in correct responses between high and low total test score groups
Ranges from -1.0 to +1.0, with higher D values indicating better discrimination.
What is an acceptable D value for most tests?
D value of .30 or higher
Items of moderate difficulty typically have higher discrimination levels.
What does a reliability coefficient less than 1.0 indicate about a test score?
An examinee’s obtained test score may or may not be their true score.
What is a confidence interval in the context of test scores?
It indicates the range within which an examinee’s true score is likely to be based on their obtained score.
How is the standard error of measurement calculated?
It is calculated by multiplying the test’s standard deviation by the square root of 1 minus the reliability coefficient.
What is the standard error of measurement if the standard deviation is 5 and the reliability coefficient is .84?
2.
How do you construct a 68% confidence interval around an obtained test score?
Add and subtract one standard error of measurement to and from the obtained score.
How do you construct a 95% confidence interval around an obtained test score?
Add and subtract two standard errors of measurement to and from the obtained score.
How do you construct a 99% confidence interval around an obtained test score?
Add and subtract three standard errors of measurement to and from the obtained score.
What is the 95% confidence interval for an examinee who scored 90 with a standard error of measurement of 5?
80 to 100.
What does Item Response Theory (IRT) focus on?
Examinees’ responses to individual test items.
How does IRT differ from Classical Test Theory (CTT)?
CTT is test-based and focuses on total test scores, while IRT is item-based.
What advantage does IRT have over CTT regarding item parameters?
IRT derives sample invariant parameters using mathematical techniques and a large sample size.
What is a computerized adaptive test?
A test that tailors items to each examinee by presenting items appropriate for their level of the trait.
What is another name for Item Response Theory?
Latent trait theory.
What does the item characteristic curve (ICC) represent?
The relationship between each item and the latent trait measured by the test.
What are the two axes of the ICC graph?
Total test scores (horizontal/x-axis) and probabilities of endorsing or answering the item correctly (vertical/y-axis).
What does the difficulty parameter in IRT indicate?
The level of the trait required for a 50% probability of endorsing or answering the item correctly.
What does the discrimination parameter in IRT indicate?
How well the item can discriminate between individuals with high and low levels of the trait.
What does the slope of the ICC indicate?
The steeper the slope, the better the discrimination of the item.