Test Construction Flashcards
Item Difficulty
• The proportion of examinees in the tryout sample who answer the item correctly
• Used to measure examinees knowledge or skill level
Range 0-1
“p”
Item Difficulty
Item Discrimination
• refers to how much an item discriminates between examinees who obtain low or high scores on the test or an external criterion
• Calculated by subtracting the percent of examinees in the lower scoring group who answered the item correctly from the percent of examinees in the upper-scoring group who answered the item correctly
“D”; Range: -1-+1, .35 is acceptable
Item Discrimintation:
-1, 0, +1
+1 (positive answer), all upper-scoring group answered the item lower-scoring group answered the item incorrectly
0, both groups answered the item correctly
-1, lower-scoring group answered the item correctly upper-scoring group answered the item incorrectl
Classical Test Theory
• variability in test scores reflects a combination of true score variability and variability due to measurement (random) error X = T + E Total Variability (X) = True Score Measurement Error Variability (T) + Measurement Error (E)
True Score Variability
Actual test knowledge
Measurement Error
environment or guessing
Reliability (4 things)
make sure examinees score reflects their true score, rxx, range is 0-1, .8 is acceptable
Reliability Coefficient
measure of true score variability
Reliability 4 methods (4)
- test-retest
- alternate forms
- internal consistency
- inter-rater
Test-Retest Reliability (3)
Consistency over time, also known as coefficient of stability, NOT fluctuating chara.. or random affects
Alternate Forms Reliability (3)
consistency over 2 forms of a test, aka parallel test reliability, appropriate for forms stable over time, NOT fluxuacting chara or random affects
Internal Consistency (3)
degree of consistency across different test items. Appropriate for a single content or behavior domain Measured by split-half reliability and Cronbach’s coefficient alpha
Split-Half Reliability (6)
associated with Internal Consistency, splits test in half and correlates, coefficient corrected by Spearman Brown prophesy formula, NOT for speeded test, Crohnbech’s coefficient alpha, Kudar-Richardson formula 20
Spearman-Brown Formula
ass w/ Internal Consistency, used along with split-half more generally to estimate the effect of shortening or lengthening a test on its reliability coefficient,
Cronbach’s Coefficient Alpha
Used with Split-Half Reliability
“mean of all possible split-half a correlation coefficients”, Can’t use Alpha with “forced choice”, Cronbach’s α: used with continuous variables
Kuder-Richardson Formula 20 (KR-20) (3)
Ass w/ Split-Half Reliability, Can be used as a substitute for coefficient alpha when test items are scored dichotomously, Used for T/F Multiple choice questions when there is right or wrong
Inter-rater Reliability (5)
Important for measures that are subjectively scored ex. essay, make sure obtain the same score no matter who is doing the scoring, measured using percent agreement (over est Interrater reliability), or using Cohen’s Kappa statistic, or Kendall’s coefficient of concordance
Cohen’s kappa statistic
Ass w/ Inter-rater reliability, used to measure agreement between two raters when scores represent a nominal scale
Kendall’s coefficient of concordance
ass with inter-rater reliability, is used to measure agreement between three or more raters when scores are reported as ranks
Factors that Affect Reliability (4)
Test length: longer tests are more reliable
Range of scores: increases the size of the reliability coefficient
Content of the test: more homogenous, more reliable
Likelihood of items can be answered by guessing: less choice/guessing, more reliable
Confidence Interval
- indicates the range within which an examinee’s true score is likely to fall given his/her obtained score
- derived using the standard error of measurement (SEM)
Standard Error of Measurement (SEM)
o used to obtain a confidence interval around obtained test score
•68% confidence interval: one SEM is added to and subtracted
• 95% confidence interval: two SEM’s are added to and subtracted
•99% confidence interval: three SEM’s are added to and subtracted
Standard deviation x square root of 1-rxx;
Consensual observer drift
ass with inter-rater reliability, occurs when two or more observers working together influence each other’s ratings on a behavioral rating scale so that they assign ratings in a similar idiosyncratic way.