Test Construction Flashcards
Alternate Forms Reliability
Alternative forms reliability is evaluated by administering two forms of the test to the same group of examinees and correlating the two sets of scores. When alternate forms are administered at about the same time, this form of reliability produces a coefficient of equivalence; when forms are administered at different times, it produces a coefficient of equivalence and stability. It is considered by some experts to be the best (most thorough) method for assessing reliability
Classical Test Theory
Classical test theory describes observed variability in test scores as consisting of two components: true differences between examinees on the attribute(s) measured by the test and the effects of measurement (random) error. Reliability is a measure of true score variability.
Coefficient Alpha/KR-20
Coefficient alpha and the Kuder-Richardson Formula 20 (KR-20) are used to assess internal consistency reliability and provide an index of average inter-item consistency. KR-20 can be used as a substitute for coefficient alpha when test items are scored dichotomously
Construct Validity
Construct validity refers to the extent to which a test measures the hypothetical trait (construct) it is intended to measure. Methods for establishing construct validity include correlating test scores with scores on measures that do and do not measure the same trait (convergent and discriminant validity), conducting a factor analysis to assess the test’s factorial validity, determining if changes in test scores reflect expected developmental changes, and seeing if experimental manipulations have the expected impact on test scores
Content Validity
Content validity refers to the extent to which a test adequately samples the domain of information, knowledge, or skill that it purports to measure. It is determined primarily by the judgment of subject matters experts and is important, for example, for achievement tests and job sample tests
Convergent and discriminant validity
Convergent and discriminant validity are types of construct validity: When a test correlates highly with measure of the same and related constructs, this provides evidence of the test’s convergent validity. When a test has low correlations with measures of unrelated constructs, this provides evidence of its discriminant validity.
Criterion contamination
Criterion contamination refers to bias introduced into a person’s criterion rating as a result of the rater’s knowledge about the person’s performance on the predictor. It tends to artificially inflate the correlation between scores on the predictor and criterion
Criterion-referenced interpretations
Criterion-referenced interpretation involves interpreting a test score in terms of a prespecified standard (i.e. in terms of percent of content correct (percentage score) or by using a regression equation or expectancy table to predict performance on an external criterion based on score or status on a predictor.
Criterion-related validity/concurrent and predictive
Criterion-related validity is important when predictor scores will be used to predict or estimate scores on a criterion (e.g. when selection test scores will be used to predict or estimate job performance ratings. It is evaluated by administering the predictor and criterion to a sample and correlating their scores to obtain a criterion-related validity coefficient. Criterion-related validity can be either concurrent (predictor and criterion scores obtained at about the same time) or predictive (predictor scores obtained before criterion scores)
Cross-validation and shrinkage
Cross-validation is the process of re-assessing a test’s criterion-related validity on a new sample to check this generalizability of the original validity coefficient. Ordinarily, the validity coefficient “shrinks” (becomes smaller) on cross-validation because the chance factors operating in the original sample are not all present in the cross-validation sample.
Factor analysis
Factor analysis is a multivariate statistical technique used to determine how many factors are needed to account for the intercorrelations among a set of tests, subtests, or test items. It can be used to assess a test’s construct validity by indicating the extent to which he test correlates with factors that it would and would not be expected to correlate with. Factors identified in a factor analysis can be either orthogonal (uncorrelated) or oblique (correlated)
Factor loadings and communality
In a factor matrix, a factor loading is the correlation between a test (or other variable included in the analysis) and a factor and can be squared to determine the amount of variability in the test that is accounted for by the factor. The communality is the total amount of variability in scores on the test that is accounted for by the factor analysis (ie.g. by all of the identified factors).
Incremental validity/true positives, false positives, true negatives, false negative
Incremental validity refers to the extent to which a predictor increases decision-making accuracy. It is calculated by subtracting the base rate from the positive hit rate. Terms to have linked with incremental validity are predictor and criterion cutoff scores, true and false positives, and true and false negatives.
True positives are people who scored high on the predictor (IV) and criterion (DV)
False positives scored high on the predictor (IV) but low on the criterion (DV)
True negatives scored low on the predictor (IV) and the criterion (DV)
False negatives scored low on the predictor (IV) but high on the criterion (DV)
Item Characteristic Curve
When using item response theory to construct a test, an item characteristic curve (ICC) is derived for each item by plotting the proportion of examinees in the trout sample who answered the item correctly against the total test score, performance on an external criterion, or a mathematically derived estimate of a latent ability or trait.
Item Difficulty
An item’s difficulty level is calculated by dividing the number of individuals in the trout sample who answered the item correctly by the total number of individuals. The item difficulty index (p) ranges in value from 0 (very difficult item) to 1.0 (very easy item). In general, an index of .50 is preferred because it maximizes differentiation between individuals with high and low ability and helps ensure a high reliability coefficient