Test Construction Flashcards
Item difficulty
- Measured using an item difficulty index ranging from 0 - 1
- Equation = Total number of examinees passing the exam divided by total number of examinees
- Optimal difficulty level depends on likelihood of answering correctly by chance, the goal of the testing, etc.
Item discrimination
- The extent to which an item differentiates between examinees who obtain high versus low scores on the entire test
- D = U - L
- Ranges from -1 to +1
- Items with a discrimination index of .35 or higher is typically acceptable
Item characteristic curve
- Constructed for each item
- Plot the proportion of examinees in the sample who answered correctly against the total test score, performance on an external criterion, or an estimate of the latent ability or trait measured by the item
Item response theory
Is sample invariant
Classical test theory
Uses 2 methods of item analysis: item difficulty and item discrimination
Limitations of CTT
- Item and test parameters are sample dependent
- Difficult to equate scores across content
Item’s level of difficulty
Ability level at which 50% of the examinees provide a correct response
Item’s ability to discriminate
Indicated by the slope of the curve
The steeper the slope, the greater the discrimination
Probability of guessing correctly
Indicated by the point at which the ICC intercepts the vertical axis
Test score in Classical Test Theory
X = T + E
T = True score component
E = Error component (measurement error)
Reliability coefficient
Ranges from 0 to 1
Correlation coefficient
Unlike most correlations, the r is never squared
Ex. a reliability coefficient of .89 means that 89% of variability in obtained scores is true score variability
Test-retest reliability
Same test to same group of examinees on two different occasions
Coefficient indicates stability/consistency
May be impacted by a PRACTICE EFFECT
Alternate forms reliability
Two forms of the same test are administered at the same time point
Coefficient of equivalence
May be impacted by CONTENT SAMPLING
Good for speeded tests
Split-half reliability
Scores on two halves of the test are correlated (e.g., odd versus even numbered items)
Usually underestimates a test’s true reliability
Spearman-Brown Prophecy Formula
Used to correct the split-half reliability coefficient
Provides an estimate of the reliability coefficient in a given test length
Tends to overestimate a tests true reliability
Cronbach’s coefficient Alpha
Calculates the average reliability from all possible splits of the test
This is a conservative measurement of reliability
Kuder-Richardson Formula 20
Used when test items are scored dichotomously
Kappa statistic/Cohen’s Kappa
Used when scores or ratings represent a nominal or ordinal scale of measurement
Test length
The longer the test, the less the relative effects of measurement error and the larger the reliability coefficient
Spearman Brown prophecy formula can also be used to estimate the impact of lengthening a test
Range of scores
The reliability coefficient is maximized then the range of scores is unrestricted
Standard error of measurement
Index of the amount of error that can be expected in obtained scores due to the unreliability of the test
SEM = SD sqrt(1-reliability coefficient)
Internal consistency reliability
How well items within a test correlate with other items on the same test
This includes split-half reliability and Cronbach’s coefficient alpha
Content sampling
Impacts split-half reliability and coefficient alpha
Inter-rater reliability
Measured using a kappa or a percent agreement
Kendall’s coefficient of concordance
Used to assess inter-rater reliability when three or more raters and ratings are reported as ranks
Consensual observer drift
When two or more observers working together influence each other’s ratings and both assign ratings in a similarly idiosyncratic way
Impact of guessing on reliability
As the probability of guessing correctly increases, the reliability coefficient decreases
Content validity
Items on the test adequately represent the domain being measured
Construct validity
The test has expected relationships with other variables
Convergent and Discriminant Validity
Convergent validity = high correlations ith measures of the same and related traits
Discriminant validity = low correlations with measures of unrelated characteristics
Multitrait-Multimethod Matrix
Systematically organizes data collected when assessing a test’s convergent and discriminant validity
Includes coefficients that are: monotrait-monomethod, monotrait-heteromethod, heterotrait-monomethod, heterotrait-heteromethod
Criterion-related Validity
Test correlates/predicts an examinee’s performance on some external criterion
Standard error of estimate
SEE = Standard deviation of criterion scores sqrt(1-validity coefficient squared)
Used to construct a confidence interval around an estimated score
Incremental validity
The increase in correct decisions that can be expected if the predictor is used as a decision making tool
Positive hit rate - base rate
Specificity and sensitivity
Provide information about a predictor’s accuracy when administered to a group of individuals who are known to have a disorder of interest
Sensitivity = % of people ho have the disorder and were accurately identified by true positives and false negatives
Specificity = % of people who do not have the disorder and were accurately identified by true negatives and false positives
Criterion Contamination
Cross-valdiation
Shrinkage
Concurrent validity
A form of criterion-related validity.
When criterion data are collected prior to or at the same time as data on the predictor
Predictive validity
When the criterion is measured at some point after the predictor has been administered
Criterion related validity coefficient
Ranges from -1 to 1
Positive predictive value
Probability that people ho test positive have the disorder
Positive likelihood ratio
The extent to which a positive result affects the probability that the person has a disorder
A useful predictor should have an LR+ of at least 1.0
Relationship between reliability and validity
A test’s reliability always places a ceiling on its validity
High reliability however does not guarantee validity