Chapter 6: Validity Flashcards
as applied to a test, is a judgment or estimate of how well a test measures what it purports to measure in a particular context. More specifically, it is a judgment based on evidence about the appropriateness of inferences drawn from test scores.
is the process of gathering and evaluating evidence about validity. Both the test developer and the test user may play a role in the validation of a test for a specific purpose.
Validity
One way measurement specialists have traditionally conceptualized validity is according to three categories (trinitarian view):
1) content validity
2) criterion-related validity
3) construct validity = umbrella validity
Three approaches to assessing validity—associated, respectively, with content validity, criterion-related validity, and construct validity—are:
- scrutinizing the test’s content
- relating scores obtained on the test to other test scores or other measures
- executing a comprehensive analysis of
a. how scores on the test relate to other test scores and measures
b. how scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure
Another term you may come across in the literature is _____. This variety of validity has been described as the “Rodney Dangerfield of psychometric variables” because it has “received little attention—and even less respect—from researchers examining the construct validity of psychological tests and measures”.
face validity
_____ relates more to what a test appears to measure to the person being tested than to what the test actually measures. Face validity is a judgment concerning how relevant the test items appear to be.
**from the perspective of the testtaker, not the test user.
A test’s lack of _____ could contribute to a lack of confidence in the perceived effectiveness of the test—with a consequential decrease in the testtaker’s cooperation or motivation to do his or her best.
Face validity
_____ describes a judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample.
Content validity
The quantification of content validity
The measurement of _____ is important in employment settings, where tests used to hire and promote people are carefully scrutinized for their relevance to the job, among other factors.
content validity
One method of measuring content validity, developed by C. H. Lawshe, is essentially a method for gauging agreement among raters or judges regarding how essential a particular item is.
**content validity ratio (CVR)
Lawshe
Criterion-Related Validity (3)
1) Criterion-related
2) Concurrent
3) Predictive
Criterion-Related Validity:
is a judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest—the measure of interest being the criterion.
Ex. a company could administer a sales personality test to its sales staff to see if there is an overall correlation between their test scores and a measure of their productivity.
Criterion-Related Validity
_____ is an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently).
Ex. let’s say a group of nursing students take two final exams to assess their knowledge. One exam is a practical test and the second exam is a paper test. If the students who score well on the practical test also score well on the paper test, then ______ has occurred.
Concurrent validity
_____ is an index of the degree to which a test score predicts some criterion measure.
Ex. the SAT test is taken by high school students to predict their future performance in college (namely, their college GPA).
Predictive validity
Characteristics of a criterion (3)
1) An adequate criterion is relevant
2) An adequate criterion measure must also be valid
3) Ideally, a criterion is also uncontaminated.
Judgments of criterion-related validity, whether concurrent or predictive, are based on two types of statistical evidence: (2)
Validity Coefficient and Expectancy Data.
Type of statistical evidence:
The _____ is a correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure.
Typically, the Pearson correlation coefficient is used to determine the validity between the two measures.
validity coefficient
Let’s say a medical practitioner is more likely to correctly diagnose a kidney infection if a urine test is ordered, rather than relying on a physical examination and discussion of symptoms alone. We can say that the urine test has _____.
Incremental validity
Type of statistical evidence:
_____ provide information that can be used in evaluating the criterion-related validity of a test. Using a score obtained on some test(s) or measure(s), expectancy tables illustrate the likelihood that the testtaker will score within some interval of scores on a criterion measure—an interval that may be seen as “passing”, “acceptable”, and so on.
Expectancy data
An _____ shows the percentage of people within specified test-score intervals who subsequently were placed in various categories of the criterion (for example, placed in “passed” category or “failed” category).
expectancy table
_____ provide an estimate of the extent to which inclusion of a particular test in the selection system will actually improve selection. More specifically, the tables provide an estimate of the percentage of employees hired by the use of a particular test who will be successful at their jobs, given different combinations of three variables: the test’s validity, the selection ratio used, and the base rate.
Taylor-Russel Tables
_____ limitations:
- Relationship between the predictor (the test) and the criterion (rating of performance on the job) must be linear
- Potential difficulty of identifying a criterion score that separates “successful” from “unsuccessful” employees.
Taylor-Russell tables
_____ entails obtaining the difference between the means of the selected and unselected groups to derive an index of what the test (or some other tool of assessment) is adding to already established procedures.
Naylor-Shine tables
Perhaps the most oft-cited application of statistical decision theory to the field of psychological testing is _____ Psychological Tests and Personnel
Cronbach and Gleser’s
Stated generally, _____ presented
- a classification of decision problems;
- various selection strategies ranging from single-stage processes to sequential analyses;
- a quantitative analysis of the relationship between test utility, the selection ratio, cost of the testing program, and expected value of the outcome; and
- a recommendation that in some instances job requirements be tailored to the applicant’s ability instead of the other way around (a concept they refer to as adaptive treatment).
Cronbach and Gleser
Generally, a _____ is the extent to which a particular trait, behavior, characteristic, or attribute exists in the population (expressed as a proportion).
base rate
In psychometric parlance, a _____ may be defined as the proportion of people a test accurately identifies as possessing or exhibiting a particular trait, behavior, characteristic, or attribute
For example, _____ could refer to the proportion of people accurately predicted to be able to perform work at the graduate school level or to the proportion of neurological patients accurately identified as having a brain tumor.
hit rate
In like fashion, a _____ may be defined as the proportion of people the test fails to identify as having, or not having, a particular characteristic or attribute. Here, a miss amounts to an inaccurate prediction
miss rate
The category of misses may be further subdivided (2)
false positive (type 1 error) and false negative (type 2 error)
The category of misses may be further subdivided (2)
A _____ is a miss wherein the test predicted that the testtaker did possess the particular characteristic or attribute being measured when in fact the testtaker did not.
Ex. A pregnancy test is positive, when in fact you aren’t pregnant
Type 1 false positive
The category of misses may be further subdivided (2)
A _____ is a miss wherein the test predicted that the testtaker did not possess the particular characteristic or attribute being measured when the testtaker actually did.
Ex. a pregnancy test indicates a woman is not pregnant, but she is
Type 2 false negative
The category of misses may be further subdivided (2)
A _____ is a miss wherein the test predicted that the testtaker did not possess the particular characteristic or attribute being measured when the testtaker actually did.
Ex. a pregnancy test indicates a woman is not pregnant, but she is
Type 2 false negative
_____ provides guidelines for setting optimal cutoff scores. In setting such scores, the relative seriousness of making false-positive or false-negative selection decisions is frequently taken into
account.
It is concerned with how real decision-makers make decisions, and with how optimal decisions can be reached.
Decision theory
Evidence of Construct Validity (7)
1) Evidence of homogeneity
2) Evidence of changes with age
3) Evidence of pretest–posttest changes
4) Evidence from distinct groups
5) Convergent evidence
6) Discriminant evidence
7) Factor analysis
1) Evidence of homogeneity
2) Evidence of changes with age
3) Evidence of pretest-posttest changes
4) Evidence from distinct groups
5) Convergent evidence
6) Discriminant evidence
7) Factor analysis
Evidence of Construct Validity (7)
Evidence of Construct Validity (7)
When describing a test and its items, homogeneity refers to how uniform a test is in measuring a single concept.
Correlations between subtest scores and total test score are generally reported in the test manual as _____.
One way a test developer can improve the homogeneity of a test containing items that are scored dichotomously (for example, true–false) is by eliminating items that do not show significant correlation coefficients with total test scores.
Coefficient alpha may also be used in estimating the homogeneity of a test composed of multiple-choice items
Evidence of homogeneity
Evidence of Construct Validity (7)
_____, like evidence of test homogeneity, does not in itself provide information about how the construct relates to other constructs.
If a test score purports to be a measure of a construct that could be expected to change over time, then the test score, too, should show the same progressive changes with age to be considered a valid measure of the construct
Evidence of changes with age
Evidence of Construct Validity (7)
Evidence that test scores change as a result of some experience between a _____ can be evidence of construct validity.
Evidence of pretest–posttest changes
Evidence of Construct Validity (7)
Also referred to as the method of contrasted groups, one way of providing evidence for the validity of a test is to demonstrate that scores on the test vary in a predictable way as a function of membership in some group.
iba result ng test from groups
Evidence from distinct groups
Evidence of Construct Validity (7)
Evidence for the construct validity of a particular test may converge from a number of sources, such as other tests or measures designed to assess the same (or a similar) construct.
Thus, if scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, and already validated tests designed to measure the same (or a similar) construct, this would be an example of _____.
similar constructs
Convergent evidence
Evidence of Construct Validity (7)
A validity coefficient showing little (that is, a statistically insignificant) relationship between test scores and/or other variables with which scores on the test being construct-validated should not theoretically be correlated provides _____ of construct validity (also known as _____).
Marital Satisfaction Scale (MSS), its authors correlated scores on that instrument with scores on the Marlowe-Crowne Social Desirability Scale. Roach et al. hypothesized that high correlations between these two instruments would suggest that respondents were probably not answering items on the MSS entirely honestly but instead were responding in socially desirable ways.
The multitrait-multimethod matrix is the matrix or table that results from correlating variables (traits) within and between methods.
different constructs
discriminant evidence/validity
Evidence of Construct Validity (7)
_____ is a shorthand term for a class of mathematical procedures designed to identify factors or specific variables that are typically attributes, characteristics, or dimensions on which people may differ.
In psychometric research, _____ is frequently employed as a data reduction method in which several sets of scores and the correlations between them are analyzed.
Factor analysis
Evidence of Construct Validity (7)
_____ typically entails “estimating, or extracting factors; deciding how many factors to retain; and rotating factors to an interpretable orientation”.
Ex. the physical height, weight, and pulse rate of a human being.
Exploratory factor analysis
Evidence of Construct Validity (7)
By contrast, in _____, “a factor structure is explicitly hypothesized and is tested for its fit with the observed covariance structure of the measured variables”
confirmatory factor analysis
Evidence of Construct Validity (7)
_____ in a test conveys information about the extent to which the factor determines the test score or scores. A new test purporting to measure bulimia, for example, can be factor-analyzed with other known measures of bulimia, as well as with other kinds of measures (such as measures of intelligence, self-esteem, general anxiety, anorexia, or perfectionism).
High _____ by the new test on a “bulimia factor” would provide convergent evidence of construct validity. Moderate to low _____ by the new test with respect to measures of other eating disorders such as anorexia would provide discriminant evidence of construct validity.
Factor loading
For the general public, the term _____ as applied to psychological and educational tests may conjure up many meanings having to do with prejudice and preferential treatment.
For federal judges, the term _____ as it relates to items on children’s intelligence tests is synonymous with “too difficult for one group as compared to another”.
For psychometricians, _____ is a factor inherent in a test that systematically prevents accurate, impartial measurement.
_____ implies systematic variation
Ex. if idiomatic cultural expressions—such as “an old flame” or “an apples-and-oranges comparison”—are used that may be unfamiliar to recently arrived immigrant students who may not yet be proficient in the English language or in American cultural
Test Bias
If, for example, a test systematically underpredicts or overpredicts the performance of members of a particular group (such as people with green eyes) with respect to a criterion (such as supervisory rating), then it exhibits what is known as _____. _____ is a term derived from the point where the regression line intersects the Y-axis.
Ex. validity coefficients and criterion performance for different groups are the same, but their mean scores on the predictor differ.
intercept bias
If a test systematically yields significantly different validity coefficients for members of different groups, then it has what is known as _____ —so named because the slope of one group’s regression line is different in a statistically significant way from the slope of another group’s regression line.
Ex. where the GIA predicts math achievement in Grades 1 to 4. Note. Females’ scores are underpredicted and males’ scores are overpredicted in lower scores, but after the interaction, males’ scores are underpredicted and females’ scores are overpredicted.
slope bias
Simply stated, a _____ is a judgment resulting from the intentional or unintentional misuse of a rating scale.
Rating error
At the other extreme is a _____. Movie critics who pan just about everything they review may be guilty of _____. Of course, that is only true if they review a wide range of movies that might consensually be viewed as good and bad.
Ex. sobrang negats dimunyu
severity error
A _____ is, as its name implies, an error in rating that arises from the tendency on the part of the rater to be lenient in scoring, marking, and/or grading.
Ex. indicated by a tendency to rate all individuals at the high end of the scale kahit iba iba efficiency ng ratees
leniency error (also known as a generosity error)
Another type of error might be termed a _____. Here the rater, for whatever reason, exhibits a general and systematic reluctance to giving ratings at either the positive or the negative extreme. Consequently, all of this rater’s ratings would tend to cluster in the middle of the rating continuum.
Ex. na-stuck sa gitna yu’ng scores
central tendency error
One way to overcome what might be termed restriction-of-range rating errors (central tendency, leniency, severity errors) is to use _____, a procedure that requires the rater to measure individuals against one another instead of against an absolute scale. By using rankings instead of ratings, the rater (now the “ranker”) is forced to select first, second, third choices, and so forth.
rankings
_____ describes the fact that, for some raters, some ratees can do no wrong. More specifically, a _____ may also be defined as a tendency to give a particular ratee a higher rating than he or she objectively deserves because of the rater’s failure to discriminate among conceptually distinct and potentially independent aspects of a ratee’s behavior.
Halo effect
With that caveat in mind, and with exceptions most certainly in the offing, we will define _____ in a psychometric context as the extent to which a test is used in an impartial, just, and equitable way.
Test Fairness
With that caveat in mind, and with exceptions most certainly in the offing, we will define _____ in a psychometric context as the extent to which a test is used in an impartial, just, and equitable way.
Test Fairness