- Measured using an item difficulty index ranging from 0 - 1 - Equation = Total number of examinees passing the exam divided by total number of examinees - Optimal difficulty level depends on likelihood of answering correctly by chance, the goal of the testing, etc.

- The extent to which an item differentiates between examinees who obtain high versus low scores on the entire test - D = U - L - Ranges from -1 to +1 - Items with a discrimination index of .35 or higher is typically acceptable

- Item and test parameters are sample dependent - Difficult to equate scores across content

Test Construction Flashcards by Emily Lipner

Item difficulty

Measured using an item difficulty index ranging from 0 - 1
Equation = Total number of examinees passing the exam divided by total number of examinees
Optimal difficulty level depends on likelihood of answering correctly by chance, the goal of the testing, etc.

How well did you know this?

Not at all

Perfectly

Item discrimination

The extent to which an item differentiates between examinees who obtain high versus low scores on the entire test
D = U - L
Ranges from -1 to +1
Items with a discrimination index of .35 or higher is typically acceptable

How well did you know this?

Not at all

Perfectly

Item characteristic curve

Constructed for each item
Plot the proportion of examinees in the sample who answered correctly against the total test score, performance on an external criterion, or an estimate of the latent ability or trait measured by the item

How well did you know this?

Not at all

Perfectly

Item response theory

Is sample invariant

How well did you know this?

Not at all

Perfectly

Classical test theory

Uses 2 methods of item analysis: item difficulty and item discrimination

How well did you know this?

Not at all

Perfectly

Limitations of CTT

Item and test parameters are sample dependent
Difficult to equate scores across content

How well did you know this?

Not at all

Perfectly

Item’s level of difficulty

Ability level at which 50% of the examinees provide a correct response

How well did you know this?

Not at all

Perfectly

Item’s ability to discriminate

Indicated by the slope of the curve
The steeper the slope, the greater the discrimination

How well did you know this?

Not at all

Perfectly

Probability of guessing correctly

Indicated by the point at which the ICC intercepts the vertical axis

How well did you know this?

Not at all

Perfectly

Test score in Classical Test Theory

X = T + E
T = True score component
E = Error component (measurement error)

How well did you know this?

Not at all

Perfectly

Reliability coefficient

Ranges from 0 to 1
Correlation coefficient
Unlike most correlations, the r is never squared
Ex. a reliability coefficient of .89 means that 89% of variability in obtained scores is true score variability

How well did you know this?

Not at all

Perfectly

Test-retest reliability

Same test to same group of examinees on two different occasions
Coefficient indicates stability/consistency
May be impacted by a PRACTICE EFFECT

How well did you know this?

Not at all

Perfectly

Alternate forms reliability

Two forms of the same test are administered at the same time point
Coefficient of equivalence
May be impacted by CONTENT SAMPLING
Good for speeded tests

How well did you know this?

Not at all

Perfectly

Split-half reliability

Scores on two halves of the test are correlated (e.g., odd versus even numbered items)
Usually underestimates a test’s true reliability

How well did you know this?

Not at all

Perfectly

Spearman-Brown Prophecy Formula

Used to correct the split-half reliability coefficient
Provides an estimate of the reliability coefficient in a given test length
Tends to overestimate a tests true reliability

How well did you know this?

Not at all

Perfectly

Cronbach’s coefficient Alpha

Calculates the average reliability from all possible splits of the test
This is a conservative measurement of reliability

How well did you know this?

Not at all

Perfectly

Kuder-Richardson Formula 20

Used when test items are scored dichotomously

How well did you know this?

Not at all

Perfectly

Kappa statistic/Cohen’s Kappa

Used when scores or ratings represent a nominal or ordinal scale of measurement

Test length

The longer the test, the less the relative effects of measurement error and the larger the reliability coefficient
Spearman Brown prophecy formula can also be used to estimate the impact of lengthening a test

Range of scores

The reliability coefficient is maximized then the range of scores is unrestricted

Standard error of measurement

Index of the amount of error that can be expected in obtained scores due to the unreliability of the test
SEM = SD sqrt(1-reliability coefficient)

Internal consistency reliability

How well items within a test correlate with other items on the same test
This includes split-half reliability and Cronbach’s coefficient alpha

Content sampling

Impacts split-half reliability and coefficient alpha

Inter-rater reliability

Measured using a kappa or a percent agreement

Kendall's coefficient of concordance

Used to assess inter-rater reliability when three or more raters and ratings are reported as ranks

Consensual observer drift

When two or more observers working together influence each other's ratings and both assign ratings in a similarly idiosyncratic way

Impact of guessing on reliability

As the probability of guessing correctly increases, the reliability coefficient decreases

Content validity

Items on the test adequately represent the domain being measured

Construct validity

The test has expected relationships with other variables

Convergent and Discriminant Validity

Convergent validity = high correlations ith measures of the same and related traits Discriminant validity = low correlations with measures of unrelated characteristics

Multitrait-Multimethod Matrix

Systematically organizes data collected when assessing a test's convergent and discriminant validity Includes coefficients that are: monotrait-monomethod, monotrait-heteromethod, heterotrait-monomethod, heterotrait-heteromethod

Criterion-related Validity

Test correlates/predicts an examinee's performance on some external criterion

Standard error of estimate

SEE = Standard deviation of criterion scores sqrt(1-validity coefficient squared) Used to construct a confidence interval around an estimated score

Incremental validity

The increase in correct decisions that can be expected if the predictor is used as a decision making tool Positive hit rate - base rate

Specificity and sensitivity

Provide information about a predictor's accuracy when administered to a group of individuals who are known to have a disorder of interest Sensitivity = % of people ho have the disorder and were accurately identified by true positives and false negatives Specificity = % of people who do not have the disorder and were accurately identified by true negatives and false positives

Criterion Contamination

Cross-valdiation

Shrinkage

Concurrent validity

A form of criterion-related validity. When criterion data are collected prior to or at the same time as data on the predictor

Predictive validity

When the criterion is measured at some point after the predictor has been administered

Criterion related validity coefficient

Ranges from -1 to 1

Positive predictive value

Probability that people ho test positive have the disorder

Positive likelihood ratio

The extent to which a positive result affects the probability that the person has a disorder A useful predictor should have an LR+ of at least 1.0

Relationship between reliability and validity

A test's reliability always places a ceiling on its validity High reliability however does not guarantee validity