Test Construction Flashcards by Savannah Leyda

Concurrent validity

Concurrent validity is a type of criterion-related validity and is of concern when the purpose of the test is to estimate an individual’s current status on some external criterion. Screening tests are typically used to estimate the results of a more thorough evaluation (the criterion), which, in this case, would be a more thorough diagnostic procedure; An educational psychologist designs a screening test to identify underachieving first- and second-grade children who have a
learning disability should be worried about this validity

How well did you know this?

Not at all

Perfectly

predictive validity

used to estimate/predict future status

How well did you know this?

Not at all

Perfectly

content validity

Content validity is of most concern when test items are expected to be a representative sample of a particular content domain.

How well did you know this?

Not at all

Perfectly

ALTERNATE FORMS RELIABILITY

Alternative forms reliability is evaluated by administering two forms of the test to the same group of examinees at the same time and correlating the two sets of scores. This form of reliability produces a coefficient of equivalence. It is considered by some experts to be the best (most thorough) method for assessing reliability.

How well did you know this?

Not at all

Perfectly

CLASSICAL TEST THEORY

Classical test theory describes observed variability in test scores as consisting of two components: true differences between examinees on the attribute(s) measured by the test and the effects of measurement (random) error. Reliability is a measure of true score variability. This is summarized by the formula X=T+E, which indicates that an examinee’s observed score on a test is equal to their true score plus error.

How well did you know this?

Not at all

Perfectly

COEFFICIENT ALPHA/KR-20:

Coefficient alpha and the Kuder-Richardson Formula 20 (KR-20) are used to assess internal consistency reliability and provide an index of average inter-item consistency. KR-20 can be used as a substitute for coefficient alpha when test items are scored dichotomously

How well did you know this?

Not at all

Perfectly

CONSTRUCT VALIDITY:

Construct validity refers to the extent to which a test “behaves” in the expected way in relation to other variables. Methods for establishing construct validity include correlating test scores with scores on measures that do and do not measure the same trait (convergent and discriminant validity), conducting a factor analysis to assess the test’s factor structure, determining if changes in test scores reflect expected developmental changes, and seeing if experimental manipulations have the expected impact on test scores

How well did you know this?

Not at all

Perfectly

CONTENT VALIDITY:

Content validity refers to the extent to which a test adequately samples the domain of that the construct it purports to measure. It is determined primarily by the judgment of subject matter experts.

How well did you know this?

Not at all

Perfectly

CONVERGENT AND DISCRIMINANT VALIDITY:

Convergent and discriminant validity are types of construct validity: When a test correlates highly with measures of the same and related constructs, this provides evidence of the test’s convergent validity. When a test has low correlations with measures of unrelated constructs, this provides evidence of its discriminant validity.

How well did you know this?

Not at all

Perfectly

CRITERION CONTAMINATION:

Criterion contamination refers to bias introduced into a person’s criterion rating as a result of the rater’s knowledge about the person’s performance on the predictor. It tends to artificially inflate the correlation between scores on the predictor and criterion.

How well did you know this?

Not at all

Perfectly

CRITERION-REFERENCED INTERPRETATION:

Criterion-referenced interpretation involves interpreting a test score in terms of a prespecified standard – i.e., in terms of percent of content correct (percentage score) or by using a regression equation or expectancy table to predict performance on an external criterion based on score or status on a predictor.

How well did you know this?

Not at all

Perfectly

CRITERION-RELATED VALIDITY/CONCURRENT AND PREDICTIVE VALIDITY:

Criterion-related validity measures how a test correlates with, or predicts, scores on some external criterion. It is evaluated by administering the predictor and criterion to a sample and correlating their scores to obtain a criterion-related validity coefficient. Criterion-related validity can be either concurrent (predictor and criterion scores obtained at about the same time) or predictive (predictor scores obtained before criterion scores).

How well did you know this?

Not at all

Perfectly

CROSS-VALIDATION AND SHRINKAGE:

Cross-validation is the process of re-assessing a test’s criterion-related validity on a new sample to check the generalizability of the original validity coefficient. Ordinarily, the validity coefficient “shrinks” (becomes smaller) on cross-validation because the chance factors operating in the original sample are not all present in the cross-validation sample.

How well did you know this?

Not at all

Perfectly

FACTOR ANALYSIS:

Factor analysis is a multivariate statistical technique used to determine how many factors are needed to account for the intercorrelations among a set of test items or subtests. It can be used to assess a test’s construct validity by indicating the extent to which the test conforms to the expected factor structure. Factors identified in a factor analysis can be either orthogonal (uncorrelated) or oblique (correlated)

How well did you know this?

Not at all

Perfectly

FACTOR LOADINGS AND COMMUNALITY

In a factor matrix, a factor loading is the correlation between an item and a factor and can be squared to determine the amount of variability in the item scores that is accounted for by the factor. The communality is the total amount of variability in scores on the item that is accounted for by the factor analysis – i.e., by all of the identified factors.

How well did you know this?

Not at all

Perfectly

INCREMENTAL VALIDITY/TRUE POSITIVES, FALSE POSITIVES, TRUE NEGATIVES, FALSE NEGATIVES:

Incremental validity refers to the extent to which a predictor increases decision-making accuracy. It is calculated by subtracting the base rate from the positive hit rate. Terms to have linked with incremental validity are predictor and criterion cutoff scores, true and false positives, and true and false negatives. True positives are people who scored high on the predictor and criterion; false positives scored high on the predictor but low on the criterion; true negatives scored low on the predictor and the criterion; and false negatives scored low on the predictor but high on the criterion.

RELATIONSHIP BETWEEN RELIABILITY AND VALIDITY:

Reliability is a necessary but not sufficient condition for validity. In terms of criterion-related validity, the validity coefficient can be no greater than the square root of the product of the reliabilities of the predictor and criterion.

ITEM DIFFICULTY:

An item’s difficulty level is calculated by dividing the number of individuals in the tryout sample who answered the item correctly by the total number of individuals. The item difficulty index (p) ranges in value from 0 (very difficult item) to 1.0 (very easy item). In general, a difficulty index of .50 is preferred because it maximizes differentiation between individuals with high and low ability and helps ensure a high reliability coefficient.

SENSITIVITY AND SPECIFICITY:

Sensitivity and specificity provide information about a predictor’s accuracy when administered to a group of individuals who are known to have or not have the disorder (or other characteristic) of interest. Sensitivity is the percent of people in the tryout sample who have the disorder and were accurately identified by the predictor as having the disorder. Specificity is the percent of people in the tryout sample who do not have the disorder and were accurately identified by the predictor as not having the disorder.

NORM-REFERENCED INTERPRETATION:

Norm-referenced interpretation involves interpreting an examinee’s test score in terms of the scores obtained by examinees in a normative (standardization) sample. Percentile ranks and standard scores (e.g., z-scores and T-scores) are types of norm-referenced scores.

ITEM CHARACTERISTIC CURVE

When using item response theory to construct a test, an item characteristic curve (ICC) is derived for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically derived estimate of a latent ability or trait. Depending on which model is used, the curve provides information on one, two, or three parameters – difficulty, discrimination, and probability of guessing correctly.

MULTITRAIT-MULTIMETHOD MATRIX:

The multitrait-multimethod matrix is a table that is used to organize the correlation coefficients obtained when assessing a measure’s convergent and discriminant validity. Use of the matrix requires measuring at least two different traits using at least two different methods for each trait. Terms to have linked with the multitrait-multimethod matrix are monotrait-monomethod, monotrait-heteromethod, heterotrait-monomethod, and heterotrait-heteromethod coefficients.

ORTHOGONAL AND OBLIQUE ROTATION

When conducting a factor analysis, the initial factors are rotated to simplify their interpretation. An orthogonal rotation produces uncorrelated factors, while an oblique rotation produces correlated factors.

ITEM DISCRIMINATION

Item discrimination refers to the extent to which a test item discriminates between examinees who obtain high versus low scores on the entire test. The item discrimination index (D) ranges from -1.0 to +1.0: When all examinees in the upper-scoring group and none in the lower-scoring group answered the item correctly, D is +1.0; when none of the examinees in the upper-scoring group and all examinees in the lower-scoring group answered the item correctly, D equals -1.0.

KAPPA STATISTIC:

The kappa statistic is also known as the kappa coefficient and is used to evaluate inter-rater reliability when ratings represent a nominal or ordinal scale of measurement.

RELIABILITY/RELIABILITY COEFFICIENT:

Reliability refers to the consistency of test scores – i.e., the extent to which a test measures an attribute without being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms. Methods for establishing reliability include test-retest, alternative forms, split-half, coefficient alpha, and inter-rater. Most produce a reliability coefficient, which is interpreted directly as a measure of true score variability – e.g., a reliability coefficient of .80 indicates that 80% of variability in test scores is true score variability.

SPLIT-HALF RELIABILITY/ SPEARMAN-BROWN FORMULA:

Split-half reliability is a method for assessing internal consistency reliability and involves “splitting” the test in half (e.g., odd-versus even-numbered items) and correlating examinees’ scores on the two halves of the test. The split-half reliability coefficient tends to underestimate a test’s actual reliability and is usually corrected with the Spearman-Brown formula, which estimates what the test’s reliability would be if it were based on the full length of the test.

STANDARD ERROR OF ESTIMATE/CONFIDENCE INTERVAL:

The standard error of estimate (SEE) provides an index of error when predicting criterion scores from predictor scores and is used to construct a confidence interval around an examinee’s predicted criterion score. Its magnitude depends on two factors: the criterion’s standard deviation and the predictor’s criterion-related validity coefficient.

STANDARD ERROR OF MEASUREMENT/CONFIDENCE INTERVAL:

The standard error of measurement (SEM) is an index of measurement error and is used to construct a confidence interval around an examinee’s obtained test score. Its magnitude depends on two factors: the test’s standard deviation and its reliability coefficient.

TEST LENGTH/RANGE OF SCORES:

Two ways to increase a test’s reliability are to increase the test’s length by adding items of similar content and quality and to increase the range of scores. The range of scores can be increased by increasing the heterogeneity of the sample in terms of the attribute(s) measured by the test and/or choosing items to include in the test so that the average difficulty level is in the mid-range (p = .50).

TEST-RETEST RELIABILITY:

Test-retest reliability is evaluated by administering the same test to the same group of examinees on two different occasions and correlating the two sets of scores. It yields a coefficient of stability.