Test Construction Flashcards
refers to the extent to which a test measures the hypothetical trait (construct) it is intended to measure. Methods for establishing construct validity include correlating test scores with scores on measures that do and do not measure the same trait (convergent and discriminant validity), conducting a factor analysis to assess the test’s factorial validity, determining if changes in test scores reflect expected developmental changes, and seeing if experimental manipulations have the expected impact on test scores.
Construct Validity
The extent to which a test adequately samples the domain of information, knowledge, or skill that it purports to measure. Determined primarily by “expert judgment.” Most important for achievement and job sample tests.
Content Validity
Refers to bias introduced into a person’s criterion score as a result of the knowledge of the scorer about his/her performance on the predictor. Tends to artificially inflate the relationship between the predictor and criterion.
Criterion Contamination
Interpretation of a test score in terms of a prespecified standard; i.e., in terms of percent of content correct (percentage score) or of predicted performance on an external criterion (e.g., regression equation, expectancy table).
Criterion-Referenced Interpretation
The type of validity that involves determining the relationship (correlation) between the predictor and the criterion. The correlation coefficient is referred to as the criterion-related validity coefficient. Criterion-related validity can be either concurrent (predictor and criterion scores obtained at about the same time) or predictive (predictor scores obtained before criterion scores).
Criterion-Related Validity/Concurrent And Predictive
Process of re-assessing a test’s criterion-related validity on a new sample to check the generalizability of the original validity coefficient. Ordinarily, the validity coefficient “shrinks” (becomes smaller) on cross-validation because the chance factors operating in the original sample are not all present in the cross-validation sample.
Cross-Validation And Shrinkage
A multivariate statistical technique used to determine how many factors (constructs) are needed to account for the intercorrelations among a set of tests, subtests, or test items. Factor analysis can be used to assess a test’s construct validity by indicating the extent to which the test correlates with factors that it would and would not be expected to correlate with. From the perspective of factor analysis, true score variability consists of communality and specificity. Factors identified in a factor analysis can be either orthogonal or oblique.
Factor Analysis
In a factor matrix, a factor loading is the correlation between a test (or other variable included in the analysis) and a factor and can be squared to determine the amount of variability in the test that is accounted for by the factor. The communality is the total amount of variability in scores on the test that is accounted for by the factor analysis - i.e., by all of the identified factors.
Factor Loadings and Communality
The extent to which a predictor increases decision-making accuracy. Calculated by subtracting the base rate from the positive hit rate. Terms to have linked with incremental validity are predictor and criterion cutoff scores; true and false positives and true and false negatives. True positives are those who scored high on the predictor and criterion; false positives scored high on the predictor but low on the criterion; true negatives scored low on the predictor and the criterion; and false negatives scored low on the predictor but high on the criterion.
Incremental Validity/True Positives, False Positives, True Negatives, False Negatives
When using item response theory, an item characteristic curve (ICC) is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically-derived estimate of a latent ability or trait. The curve provides information on the relationship between an examinee’s level on the ability or trait measured by the test and the probability that he/she will respond to the item correctly.
Item Characteristic Curve
An item’s difficulty level is calculated by dividing the number of individuals who answered the item correctly by the total number of individuals; ranges in value from 0 (very difficult item) to 1.0 (very easy item). In general, an item difficulty index of .50 is preferred because it maximizes differentiation between individuals with high and low ability and helps ensure a high reliability coefficient.
Item Difficulty
refers to the extent to which a test item discriminates (differentiates) between examinees who obtain high versus low scores on the entire test or on an external criterion. The item discrimination index (D) ranges from -1.0 to +1.0. If all examinees in the upper group and none in the lower group answered the item correctly, D is +1.0; if none of the examinees in the upper group and all examinees in the lower group answered the item correctly, D equals -1.0.
Item Discrimination
A correlation coefficient used to assess inter-rater reliability.
Kappa Statistic
A systematic way to organize the correlation coefficients obtained when assessing a measure’s convergent and discriminant validity (which, in turn, provides evidence of construct validity). Requires measuring at least two different traits using at least two different methods for each trait. Terms to have linked with multitrait-multimethod matrix are monotrait-monomethod, monotrait-heteromethod, heterotrait-monomethod, and heterotrait-heteromethod coefficients.
Multitrait-Multimethod Matrix
Interpretation of an examinee’s test performance relative to the performance of examinees in a normative (standardization) sample. Percentile ranks and standard scores (e.g., z-scores and T scores) are types of norm-referenced scores.
Norm-Referenced Interpretation