Test Construction Flashcards by Paresh Kasabwala

Construct Validity

Construct validity refers to the extent to which a test measures the hypothetical trait (construct) it is intended to measure. Methods for establishing construct validity include correlating test scores with scores on measures that do and do not measure the same trait (convergent and discriminant validity), conducting a factor analysis to assess the test’s factorial validity, determining if changes in test scores reflect expected developmental changes, and seeing if experimental manipulations have the expected impact on test scores.

How well did you know this?

Not at all

Perfectly

Content Validity

The extent to which a test adequately samples the domain of information, knowledge, or skill that it purports to measure. Determined primarily by expert judgment.” Most important for achievement and job sample tests.”

How well did you know this?

Not at all

Perfectly

Criterion Contamination

Refers to bias introduced into a person’s criterion score as a result of the knowledge of the scorer about his/her performance on the predictor. Tends to artificially inflate the relationship between the predictor and criterion.

How well did you know this?

Not at all

Perfectly

Criterion–Referenced Interpretation

Interpretation of a test score in terms of a prespecified standard; i.e., in terms of percent of content correct (percentage score) or of predicted performance on an external criterion (e.g., regression equation, expectancy table).

How well did you know this?

Not at all

Perfectly

Criterion–Related Validity/Concurrent
And Predictive

The type of validity that involves determining the relationship (correlation) between the predictor and the criterion. The correlation coefficient is referred to as the criterion–related validity coefficient. Criterion–related validity can be either concurrent (predictor and criterion scores obtained at about the same time) or predictive (predictor scores obtained before criterion scores).

How well did you know this?

Not at all

Perfectly

Cross–Validation And Shrinkage

Process of re–assessing a test’s criterion–related validity on a new sample to check the generalizability of the original validity coefficient. Ordinarily, the validity coefficient shrinks” (becomes smaller) on cross–validation because the chance factors operating in the original sample are not all present in the cross–validation sample.”

How well did you know this?

Not at all

Perfectly

Factor Analysis

A multivariate statistical technique used to determine how many factors (constructs) are needed to account for the intercorrelations among a set of tests, subtests, or test items. Factor analysis can be used to assess a test’s construct validity by indicating the extent to which the test correlates with factors that it would and would not be expected to correlate with. From the perspective of factor analysis, true score variability consists of communality and specificity. Factors identified in a factor analysis can be either orthogonal or oblique.

How well did you know this?

Not at all

Perfectly

Factor Loadings and Communality

In a factor matrix, a factor loading is the correlation between a test (or other variable included in the analysis) and a factor and can be squared to determine the amount of variability in the test that is accounted for by the factor. The communality is the total amount of variability in scores on the test that is accounted for by the factor analysis – i.e., by all of the identified factors.

How well did you know this?

Not at all

Perfectly

Incremental Validity/True Positives,
False Positives, True Negatives,
False Negatives

The extent to which a predictor increases decision–making accuracy. Calculated by subtracting the base rate from the positive hit rate. Terms to have linked with incremental validity are predictor and criterion cutoff scores; true and false positives and true and false negatives. True positives are those who scored high on the predictor and criterion; false positives scored high on the predictor but low on the criterion; true negatives scored low on the predictor and the criterion; and false negatives scored low on the predictor but high on the criterion.

How well did you know this?

Not at all

Perfectly

Item Characteristic Curve

When using item response theory, an item characteristic curve (ICC) is constructed for each item by plotting the proportion of examinees in the tryout sample who answered the item correctly against either the total test score, performance on an external criterion, or a mathematically–derived estimate of a latent ability or trait. The curve provides information on the relationship between an examinee’s level on the ability or trait measured by the test and the probability that he/she will respond to the item correctly.

How well did you know this?

Not at all

Perfectly

Item Difficulty

An item’s difficulty level is calculated by dividing the number of individuals who answered the item correctly by the total number of individuals; ranges in value from 0 (very difficult item) to 1.0 (very easy item). In general, an item difficulty index of .50 is preferred because it maximizes differentiation between individuals with high and low ability and helps ensure a high reliability coefficient.

How well did you know this?

Not at all

Perfectly

Item Discrimination

Item discrimination refers to the extent to which a test item discriminates (differentiates) between examinees who obtain high versus low scores on the entire test or on an external criterion. The item discrimination index (D) ranges from –1.0 to +1.0. If all examinees in the upper group and none in the lower group answered the item correctly, D is +1.0; if none of the examinees in the upper group and all examinees in the lower group answered the item correctly, D equals –1.0.

How well did you know this?

Not at all

Perfectly

Kappa Statistic

A correlation coefficient used to assess inter–rater reliability.

How well did you know this?

Not at all

Perfectly

Multitrait–Multimethod Matrix

A systematic way to organize the correlation coefficients obtained when assessing a measure’s convergent and discriminant validity (which, in turn, provides evidence of construct validity). Requires measuring at least two different traits using at least two different methods for each trait. Terms to have linked with multitrait–multimethod matrix are monotrait–monomethod, monotrait–heteromethod, heterotrait–monomethod, and heterotrait–heteromethod coefficients.

How well did you know this?

Not at all

Perfectly

Norm–Referenced Interpretation

Interpretation of an examinee’s test performance relative to the performance of examinees in a normative (standardization) sample. Percentile ranks and standard scores (e.g., z–scores and T scores) are types of norm–referenced scores.

How well did you know this?

Not at all

Perfectly

Orthogonal And Oblique Rotation

Study These Flashcards

In factor analysis, an orthogonal rotation of the identified factors produces uncorrelated factors, while an oblique rotation produces correlated factors. Rotation is done to simplify the interpretation of the identified factors.

Relationship Between Reliability and Validity

Study These Flashcards

Reliability is a necessary but not sufficient condition for validity. In terms of criterion–related validity, the validity coefficient can be no greater than the square root of the product of the reliabilities of the predictor and criterion.

Relevance

Study These Flashcards

In test construction, relevance refers to the extent to which test items contribute to achieving the stated goals of testing.

Reliability/Reliability Coefficient

Study These Flashcards

Reliability refers to the consistency of test scores; i.e., the extent to which a test measures an attribute without being affected by random fluctuations (measurement error) that produce inconsistencies over time, across items, or over different forms. Methods for establishing reliability include test–retest, alternative forms, split–half, coefficient alpha, and inter–rater. Most produce a reliability coefficient, which is interpreted directly as a measure of true score variability – e.g., a reliability of .80 indicates that 80% of variability in test scores is true score variability.

Sensitivity and Specificity

Study These Flashcards

Sensitivity and specificity provide information about a predictor’s accuracy when administered to a group of individuals who are known to have or not have the disorder (or other characteristic) of interest. Sensitivity is the percent of people in the tryout sample who have the disorder and were accurately identified by the predictor as having the disorder. Specificity is the percent of people in the tryout sample who do not have the disorder and were accurately identified by the predictor as not having the disorder.

Split–Half Reliability/ Spearman–Brown Formula

Study These Flashcards

Split–half reliability is a method for assessing internal consistency reliability and involves splitting” the test in half (e.g.

Standard Error Of Estimate/
Confidence Interval

Study These Flashcards

An index of error when predicting criterion scores from predictor scores. Used to construct a confidence interval around an examinee’s predicted criterion score. Its magnitude depends on two factors: the criterion’s standard deviation and the predictor’s validity coefficient.

Standard Error of Measurement/
Confidence Interval

Study These Flashcards

An index of measurement error. Used to construct a confidence interval around an examinee’s obtained test score. Its magnitude depends on two factors: the test’s standard deviation and reliability coefficient.

Test Length/Range Of Scores

Study These Flashcards

A test’s reliability can be increased in several ways. One way is to increase the test length by adding items of similar content and quality. Another is to increase the heterogeneity of the sample in terms of the attribute(s) measured by the test, which will increase the range of scores.

Test–Retest Reliability

A method for assessing reliability that involves administering the same test to the same group of examinees on two different occasions and correlating the two sets of scores. Yields a coefficient of stability.

Raw Score

The raw test score obtained by an examinee has only limited meaning. It is only when the score is tied to one of two “anchors” — either to the performance of other examinees (norm-referenced interpretation) or to an established standard of performance (criterion-referenced interpretation) — that it becomes meaningful.

Standard Score

Standard Scores: When an examinee’s raw test score is converted to a standard score, the transformed score indicates the examinee’s position in the normative sample in terms of standard deviations from the mean. A primary advantage of standard scores is that they permit comparisons of scores obtained from different tests.

Z-Score

Z-Score The most commonly used standard score is the z-score. The z-score equivalent for an examinee’s raw score is calculated by subtracting the mean of the distribution from the raw score to obtain a deviation score and then dividing the deviation score by the distribution’s standard deviation: Formula 5: Z-Scores Z = (X − M)/SD So the Z score is obtained by: Raw Score minus Mean, then dividing the result by the Standard Deviation. For example, assume that the assertiveness test has a mean of 50 and a standard deviation of 10, and that a job applicant receives a score of 60. The applicant’s z-score is +1.0: (60 - 50)/10 = +1.0. This score indicates that the applicant received a score that is one standard deviation above the mean achieved by people in the normative sample.

Norm-Referenced Interpretation

Norm-referenced interpretation involves comparing an examinee’s test score to scores obtained by people included in a normative (standardization) sample and is useful for identifying individual differences. To interpret scores in terms of norms, an examinee’s raw test score is converted to another score that indicates his/her relative standing in the norm group. The adequacy of norm-referenced interpretation relies on the extent to which the examinee’s characteristics match those of the people in the norm sample. If an examinee’s characteristics do not match, the interpretation of his or her score may be misleading. As an example, an inexperienced sales applicant’s assertiveness test score might be misinterpreted if it is compared to the distribution of scores obtained by a sample consisting of only experienced salespeople. A major difficulty with norm-referenced interpretation is finding norms that were derived from people whose characteristics are similar to those of the examinee; and this problem is compounded by the fact that, for many tests, normative data can become obsolete rather quickly.

Percentile Ranks

Percentile Ranks: A percentile rank (PR) expresses an examinee’s raw score in terms of the percentage of examinees in the norm sample who achieved lower scores. The primary advantage of percentile ranks is that they are easy to interpret: If a sales applicant’s raw score on the assertiveness test is equivalent to a percentile rank of 88, this means that 88% of the people in the norm sample obtained scores lower than the applicant’s score.

Percentile Ranks Distribution is always _______ in shape regardless of the shape of the raw score distribution.

A distinguishing characteristic of percentile ranks is that their distribution is always flat (rectangular) in shape regardless of the shape of the raw score distribution. This is because percentile ranks are evenly distributed throughout the range of scores. In other words, an equal number of scores falls between the 10th and 20th percentile, between the 20th and 30th percentile, and so on. Whenever a distribution of transformed scores differs in shape from the distribution of raw scores, as it does with percentile ranks, the score transformation is referred to as a nonlinear transformation.

Z-Scores Distribution Properties

The distribution of z-scores has the following properties: (1) The mean of the z-score distribution is equal to 0. (2) The standard deviation is equal to 1. (3) All raw scores below the mean of the distribution are negative z-scores, scores above the mean are positive z-scores, and scores equal to the mean are equal to a z-score of 0. (4) Unless it is “normalized,” the z-score distribution has the same shape as the raw score distribution. In other words, the transformation of raw scores to z-scores is a linear transformation.

Test Construction Flashcards

(32 cards)