Lecture 9: Validity Flashcards
Reliability
degree to which differences in test scores reflect differences in psychological attribute that affects test scores (does the test’s score reflect something with precision?
Validity
“the degree to which evidence and theory support the interpretations of test scores for proposed uses”
3 Points about validity
- Validity is about the interpretation of test scores in terms of a specific psychological construct; it is not about the test itself 2. Validity is matter of degree; it is not all or none 3. Validity is based on solid empirical evidence and theory (use information from research, not just choice)
What is the importance of theory for validity
questionnaires and test require a theory that defines what the construct is; otherwise it’s not possible to understand what the test scores mean (if you can’t define it, you don’t know if you’re measuring it)
Why is it important to measure validity? (what could be detrimental if you don’t)
if the psychological meaning of test scores is misinterpreted, then 1. Psychological conclusions based on those scores may be mistaken 2. Decisions made about people (based even partly on those scores) may be misguided and potentially unfair or dangerous
What are the dimensions of construct validity?
- Test Content (what are we measuring) 2. Internal Structure (Factor analysis; what are the dimensions? How do they relate) 3. Response processes (how someone thinks about an item, what are they thinking about, how does that relate to what we’re trying to ask) 4. Consequences of use (the impact—what we want or what we don’t) 5. Associations with other variables (how we’ve been thinking about it in the past; empirical evidence)
What are the three types of validity? (from book reading)
- Content
- Criterion—degree to which test scores can predict specific criterion; de-emphasizes conceptual meaning or interpretation of test scores
a. Concurrent and predictive validity (criterion is now thought to be under the construct) - Construct (now more focused)
a. Test content—content of the test match what should be on the test (expert rating)
i. Construct-irrelevant content and construct underrepresentation
ii. Face validity is close to content (but looks at the common people)
b. Internal structure—does the internal structure match what it should be (factor analysis)
c. Response processes—do the psychological processes that respondents use when completing a measure match what they should (interviewing and eye tracking)
d. Associations with other variables—do the associations with other variables match what they should be (convergent, discriminant, concurrent, and predictive correlations)
e. Consequences of use—actual consequences of using a measure match the consequences that should be seen (evaluation of consequences, differential impact and systematic changes [MCAT content reflected how premeds were taught])
What are the three types of validity from Cronbach and Meehl?
- Criterion
a. Predictive—criterion obtained after test is given
b. Concurrent—test score and criterion obtained at the same time - Content—showing the test items area sample of a universe in which the investigator is interested; deductive
a. Acceptance of the universe of the content as defining the variable to be measured is essential - Construct—when a test is interpreted as a measure of some attribute or quality which is not “operationally defined”
What is content validity?
“the systematic examination of the test content to determine whether it covers a representative sample of the behavior domain to be measured”; the relationship between the content of a test and some well-defined domain of knowledge or behavior (relationship between what it is on the test and what you think should be on the test based on the theory); good match between content and domain=high content validity); does the actual content of a test match the content that should be included on the test?
Steps for content validity
- Reflect the full scope of the construct (vs construct under-representation) 2. Systematically exclude anything besides the construct (vs construct irreverent variance); evidence via expert ratings
Construct underrepresentation
The degree to which a test fails to capture important aspects of the given construct
Irrelevant variance
the degree to which test scores are affected by processes extraneous to the intended construct
Steps to assess content validity
- Describe the content domain (in terms of boundaries/limits and structure) 2. Determine the areas of the content domain that are measured by each item (go item by item, what am I measuring?) 3. Compare the structure of the test with the structure of the content domain (do you want two domains even?)
Concerns with content validity
short forms of a test (ex. Shortened depression screening tool—it only has two items, maybe missing people or over diagnosing); content overlap (ex. Agitation or tension; if you’re measuring depression, but including things like tense or agitation, it’s more anxiety; overlap between two different questionnaires of two different constructs)
Final thoughts on content validity
content validation is a process rather than a statistical analysis (really a process where you’re going through each item and domain); content validity differs from internal consistency reliability (in internal consistency we want the items to hang together and mean the same thing; content validity is similar but more related to the content of the construct, not just if the items measure the same thing)
Face validity
if a test looks like it measures its target construct (“how sad are you?” For depression); don’t confuse this with the empirical approach (does not mean it actually measures what you’re interested in); concern: malingering (subject to response bias—all questions are, but more subject; low face validity might be better for malingering)
Internal Structure (structural validity)
does the actual internal structure of a test match the structure that the test should possess; dimensionality and factor analysis (if a test has different dimensions, how it is measuring these and how do they relate to one another)
Internal structure validity issues evaluated through factor analysis
- Number of factors 2. Meaning of factors 3. Associations between factors (if more than one factor) 4. Order of factors (do they relate to the construct overall or are they separate? Top: Depression, Second Level: 1. Cognition 2. Somatic, Third level: what relates to those)
Reponses Process Validity
do the psychological processes that actually shape peoples’ response to the test match the processes that should shape their response? (understand what the question means to the individual taking the test; people have different processes; interviewing after a questionnaire helps get a general sense of individual’s response)
Two types of evidence for response process validity
Evidence: 1. Direct evidence: interviews with respondents, “think alouds” 2. Indirect evidence: eye-tracking, response times, statistical analysis, and experimental studies of process (use if you can’t give interview or if you think they’re lying)
Consequences of Testing Validity
do the actual consequences of using a test match the consequences that should be seen?; Controversial as a facet of validity
Evidence for consequences of testing validity
evidence: 1. Evidence of intended effects 2. Evidence regarding unintended differential impact on groups (unbiased test could have biased results; women might have higher emotional intelligence than men, but if using for a screening for grad school, you’ll choose more women) 3. Evidence regarding unintended systemic effects (choosing more women for your program over time, more woman professors and faculty members, long-term effects on school and even profession)
Associations with other measures validity
does the actual associations with other measures match the associations that it should have with those measures? (are the scores obtained a good measure of the construct; extent to which the scores obtained on measure are consistent with the construct and measures related to this construct);
Evidence for associations with other measures validity
); evidence: 1. Convergent Validity (concurrent/predictive validity; expect to find significant correlations between measures of behaviors related to our construct and our test; different measures of construct should be correlated also) 2. Discriminant (divergent) validity (behaviors not associated with our construct of interest should be uncorrelated with our test)
How do you evaluate association with other measures correlations?
Some correlations .45 could be a low level of convergent OR divergent, to determine it you look at what you expect (nomological network)
Association with other measure key questions 1. How do you know which scales to measure or examine? 2. How do you know what pattern of convergent or discriminant associations to expect?
Constructs Nomological Network
Constructs Nomological Network
network of associated constructs, behaviors, and properties
Assessing criterion-related validity
(a criterion is an outcome that you want to be associated with your measure—what you hope your measure will predict) correlation of test scores with the outcomes of decisions that are made 1. Predictive validation strategies (the “ideal” approach; 1. Obtain test scores 2. Obtain performance measures and correlate these with test scores; gives you a validity coefficeint) 2. Concurrent validation strategies (a practical alternative; test scores and criterion scores from a preselected population are obtained at the same time)
Predictive validation strategies
the “ideal” approach; 1. Obtain test scores 2. Obtain performance measures and correlate these with test scores; gives you a validity coefficeint
Concurrent validation strategies
a practical alternative; test scores and criterion scores from a preselected population are obtained at the same time
Advantages to concurrent validation
- Much more practical than predative validation 2. Easier than predictive validation 3. Concurrent validity coefficients are often similar to those of predative validation
Statistical and conceptual problems with concurrent validation
quality of the predictor (often there are many factors that relate to the criterion); range restriction (reduces correlation between predictor [test score] and criterion [outcome measure]; if you’re looking at GRE score and success in grad school, you would miss those who didn’t get into grad school; can be direct, through selection by the predictor; can be indirect, through decisions based on the criterion); goal is often to screen out failures (allowed for in predictive validation not concurrent validation; predictive is much better than screening tool)
Incremental Validity
a test much demonstrate that it has better predictive ability than data from an assessment; demonstrate validity beyond existing assessments; concern: who decides the Gold Standard; supposed to choose gold standard, but some don’t just so they can prove this
Interpreting validity coefficients
correlation coefficients range from 0-1; don’t expect huge validity coefficients (usually not larger than 0.5); squaring the validity coefficient tells about the amount of variance in the criterion that can be explained by the predictor (ex. Using a test of cognitive ability to predict results in a validity coefficient of 0.5. We can say 25% of the variance in job performance is accounted for by cognitive ability)
Methods of Construct Validation
- Correlational study (correlations between measures of certain behaviors [that are either related to or unrelated to our construct of interest] and our test; meta-analysis: gather many validity studies and combining them) 2. Factor analysis (analyzing which groups of items “hang” together) 3. Experimental manipulation (manipulate the construct of interest [i.e., induce fear] and see if it relates to different scores on our test) 4. Multi-trait multi-method matrix (MTMM)
Multitrait-Multimethod Matrix (MTMM):
based on convergent and discriminant validity; three elements 1. Multitrait—measures different traits 2. Multimethod—uses various methods to measure those traits 3. Matrix—a table comprised of correlations between methods and traits measures
Base rate
level of performance on criterion in the general population; e.g., if 75% of the population successful, the base rate is 0.75
Selection ratio
ratio of positions to applicants; e.g., if 30 people apply for 3 jobs, the selection ratio is 10% or 0.10
True Positive (TP)
when a test predicts success and the person is actually success
True Negative (TN)
when a test predicts failure and a person actually fails
False Positive (FP):
When a test predicts success and the person actually fails
False Negative (FN):
when a test predicts failure and a person actually succeeds
Effect of base rate on decisions: when the base rate is large
most everyone would be successful, but there are a limited # of positions, so there will be a high number of true positives and a high number of false negatives
Effect of base rate on decisions: when base rate is small
hardly anyone would be successful, but must fill a set # of positions, so there will be a high number of true negatives and a high number of false positives
when are tests used as predictors are most likely to have an impact on accurate decision making
when the base rate is moderate (0.5)
Effect of selection ratio on decisions
: if selection ratio is high (# of positions and # of applicants almost equal), doesn’t really matter what method of prediction you use because you’re taking almost everyone who applies; validity has the biggest impact on correct decision making when selection ratio is low
sensitivity
the probability of a test to correctly identify individuals who have the disorder; sensitivity= true positives/ (true positives and false negatives; i.e. everyone who actually has the disorder);
specificity
the probability of a test to correctly identify individuals who do NOT have the disorder; specificity=true negatives/(true negatives + false positives)
Internal validity
: it depends on the strength or soundness of the design and analysis of the study in which a scientific question is evaluated, including any study that validates a psychometric test or question
History
(Events, other than the experimental treatments or the true effect among the variables, that have influenced results),
Maturation
(During the study, some kind of psychological change occurs within people completing the test that affects how test scores are related to scores on other tests and questionnaires)
Testing
Exposure to some known or unknown pretest or intervening assessment influences performance on your test)
Instrumentation
(Testing instruments or conditions in which people complete the tests are inconsistent; or the pre- and post-tests are not equivalent in some manner),
Statistical Regression
Scores of subjects that are very high or very low tend to regress towards the mean during retesting),
Selection
Systematic differences exist in subjects’ characteristics between treatment or comparison groups which affect the performance of the test)
Mortality
(Attrition affects the representativeness of the group you are studying and can lead to an over-estimate, under-estimate or an unreliable result),
Diffusion of Treatments (
People in one group are able to communicate with the other group and affect the manner in which the other group completes the questionnaire),
Effects of Pretesting/Order Effects
This is when one test in your battery of tests affects how people complete a subsequent test. Completing one test may lead to an improvement, increase in skill
Placebo Effects or Expectancy Effects
This is when you get an effect because the participant expects something to happen even those there no “active” ingredient or agent)
Demand Characteristics (
This is when people completing the questionnaires respond in a manner that they believe is expected of them
Experimenter Bias
The tests developers’ ideas about how the construct should be defined and operationalized is in some manner “biased” and as a result, out of line with how most people would define and measure it, leading to the creation of a test that produces very different results