Ch 5 - Validity Flashcards
Define validity
is the accuracy
* Reliability is necessary but insufficient for validity * BECAUSE * If scores are not precise in the 1st place, they have no meaningful interpretation * In the same way, * Precise scores may not measure what they should
Validity is usually established over many studies - no one single study/method can address all the issues of validity at once
Validity is NOT a property of a test - like reliability (avoid using “test validity”)
evidence that we bring to support any inference that is to be made on the basis of test results (textbook)
3 ideas in this definition:
• Validity is cumulative - validation is the process of gathering evidence for the validity of a test
• As the evidence accumulates, the validity may be increased or diminished
The evidence for validity can be gathered through any type of systematic scientific research by any qualified test user, even if the test author did not foreseen this application at first
validity: a matter of judgments that pertains to test scores as they are employed for a given purpose and in a given context
Kane’s interpretation-use arguments (AKA IUA perspective)
• Involves:
1. Score interpretation depends on the
2. Context of use (what is the test needed for)
3. Type of evidence needed (how are we gonna justify the use of this measure)
• 1 and 2 together will have an impact on 3
• If 1 or 2 changes (how the scores should be interpreted, the goal/rationale for using the test changes), then a new type of evidence may be needed
types of evidence for + concrete vs abstract intended interpretation of test scores
If the intended interpretation is very limited, (AKA concrete, does not lead to suppositions), then the evidence for that interpretation will be limited (and it’s ok, we don’t need a lot)
If the intended interpretation is more abstract (AKA used to make inferences about the possible future behaviour of a person), then more evidence will be needed
For example, if you use observed behaviour to make predictions of behaviour in a more distant/different future, then you will need more validity evidence
4 major categories of evidence in validity studies
- Test content (does test content match the target? The goal / intended use?)
2. Internal structure (more relevant for battery tests) (are correlations from scores over the subtests consistent with theoretical expectations)
i. Ex: cognitive ability tests (example, with 16 subtests) - we might say that 4 of them measure motricity, another 4 measure vision, etc (these expectations for the subtests should measure the theoretical expectations)
ii. Theoretical expectations - based on previous theories on the subject AND on previous studies done
3. Covariance - either with future scores or with other scores on the same test (for situations where we want to establish the predictive validity of test scores)
4. Response process (much rarer than the other 3) - (ex: when someone wants to know how ppl solve a problem - will give a problem to ppl and ask them to think out loud to see how they think - this corresponds to observing response process)
Content validity evidence
concerns whether the items on a test are representative of the target domain (type, relative frequency, proper sampling from the domain, etc)
External standard / goal standard = expert opinion
We want the test to demonstrate the level of mastery that an expert judges adequate for that domain, no less
• Is especially important for mastery tests - ex: determining if someone has an acceptable degree of mastering in a certain area - there should be evidence in the test that it indeed assesses mastery of the domain area • Good example; it's especially relevant to have adequate evidence of content validity for a driving test - we want it to represent reality, and have some challenge, to know if people are really ready to be drivers ○ If we shorten the test to save time, is it still representative of the skills needed to drive in real life?
Test specifications
Test specifications - before writing the first item, one should plan the test out (come up with a blueprint of the test) in terms of:
• Target population (who is it intended to and not)
• Context (goal of the test)
• Use (how are we gonna use it? What for?)
○ These features of the test will indicate
§ What the input of the test will be (what the content will be)
§ Cognitive operations that should be performed with the basic input (ex: rote memorization)
§ Output of the operations (ex: a written document, a list of steps, etc - corresponds to the form of the answers provided after the operations on the output)
What is the main goal in the evaluation of content validity
The evaluation of content validity is far more rational than statistical - even though there are numbers involved
• Main goal - define and recruit the appropriate content experts in the area to determine the content of the test - however, experts might use numbers to make their suggestions, but the final recommendations are mostly qualitative
Internal Structure Evidence
Example: we are constructing a test and we want the content to be representative
The experts tell us: the items should be representative of the domain, but that domain is ONE dimension only (unidimensional)
• Where does the alpha coefficient comes into play
• Internal consistency coefficient concerns internal structure
• A requirement of the alpha coefficient (or an assumption) is that the item set it represents is unidimensional
• Therefore, an alpha coefficient would be relevant for our situation
• If it’s assumed that the content domain is unidimensional and the coefficient assumes the same, it’s absolutely relevant to our situation (therefore a high value to the alpha coefficient would be relevant and indicate that everything belongs to one domain, BUT wouldn’t prove that it’s from the RIGHT domain)
This does not prove validity, but it’s a great start, since reliability is NECESSARY for validity
How can we evaluate the internal structure evidence for a battery test (multidimensional)
The test as a whole is multidimensional
In this case, an alpha coefficient can be relevant for each INDIVIDUAL area
• A Pearson correlation between the scales can also be relevant
• Are the correlations among the sets of items (subtests) of a battery consistent with theoretical expectations concerning the number of domains that should be measured, and do the subtests correspond to the domains predicted by the test authors
○ Is the association between the domains and what they are supposed to measure consistent?
Factor Analysis
Factor analysis - developped in 1904-05 by Spearman
• Original context; interested in understanding the structure of human intelligence (what are its domains, how are they arranged, how can we measure them?)
• Input data: correlation matrix and st dev
• Goal: does the pattern of correlation is consistent with theoretical predictions about what they should measureà
• Widely used in assessment research
Statistical technique to address questions similar to those in the KABC
• We have 8 subtests who are supposed to measure 2 domains (divided in 3 and 5)
○ Do the data support those expectations?
○ The data is the correlation matrix (8x8) (=rij) and the SDi
○ Latent variables = simultaneous and sequential processing - NOT directly measurable / observable except through the observation of tasks that are supposed to tap into those domains
Construct validity
usually means: do scores have any relevant interpretation for the theoretical domain that the test authors intended?
2 methods of factor analysis
There are 2 methods of factor analysis
1. Exploratory (original from Spearman) (AKA EFA)
EFA - analyses unrestricted measurement models
All the observed measures are allowed to correlate with each factor
2. Confirmatory (developped after Spearman) (AKA CFA) 1. Is part of a larger family called structural equation modelling CFA- analyses restricted model The observed variables are allowed to be associated with only certain ones of the factors
Convergent validity
hypothesis that you have multiple measures that are supposed to tap the same domain will correlate
if that is true, we should observe high intercorrelations between the test scores, their subtests as well as the whole test
How high should those correlations be? There is no clear answers, depends on
Level of measurement of the scores
Scoring metric
Etc
In general, they should be closer to one
Discriminant validity
the measures are supposed to measure different domains will not be correlated - the 2 different factors we are measuring together are not supposed to be correlated if they are really 2 distinct factors
Correlation between the 2 latent variables (1 and 2) - will be relevant for discriminant validity
Factor loading
Correlation of a task/score on one of the factors
What is the naming fallacy
Suppose that some factor model has been established to be consistent with the data
• It doesn’t mean that the name assigned to the factors by the researchers are adequate
Example: sequential processing in the KABC (name of one of the factors, underlying 3 tasks)
• All 3 of those tests involved immediate recall ONLY
Maybe then an appropriate name for the factor would be Short Term Memory
Covariance evidence
Refers to external validity
Coefficient represented by rXY
Designates a correlation between scores on subtest X and scores on an external variable Y (Y is not just another test)
Y is something that the test SHOULD measure, we expect scores on X to correlate with scores on Y
External validity
how well does test X relate to variables in Y (real-world variable of interest)
Ŷ
• Ŷ = predicted score (score on Y generated from test X)
Regression
Scores on test X and scores on external variable Y that the test is supposed to predict
The computer will fit the regression line (of best fit)
Multiple regression
2 tests (X1 and X2) used BOTH to explain Y - multiple regression (2 predictors or more)
Generalized regression
could have several Y variables that should be predicted by multiple X scores (generalized regression)
Concurrent validity
If scores on X and Y are collected at the same time, that is called concurrent validity
Ex: elaborating a test with a bunch of employees while recording their performance at the same time
Predictive validation study
get scores on X and wait before getting scores on Y
2 components of the equation that generates a predicted score on Y
Slope of the regressions line
Intercept of Y (when x=0)
2 components for the CI around Ŷ
SEest and Z score for the desired CI
What does SEest (standard error of estimate) represent
Describes the variability on the actual scores on Y around the regression line
AKA the st dev of actual scores on Y around the regression line
As the Y scores get closer to the regression line, SEest will get smaller
2 components needed to make SEest
SDy ( = st dev of scores on Y (error variance in Y scores))
rXY (=validity coefficient)/
If rXY = 1, then SEest =?
then SEest would be 0, because ALL the points would fall on the regression line
If rXY = 0, then SEest = ?
then SEest will by = SDY
AKA all variation observed in Y is error
However if the coefficient is 0, something is wrong (also if it’s 1)
What does rXY represent BEFORE and AFTER being squared, when we calculate SEest?
BEFORE: rxy is the slope of the regression line (predicted amount of change in Y in st dev units given a change in x of 1 full st dev)
AFTER: it’s the proportion of variance that is shared by X
(1- Squared rxy) is the proportion of variance not shared by X
How do we interpret a 95% CI around a value of Ŷ?
In the pop, 95% of CI constructed this way would include the person’s actual score of Y (not predicted Y), BUT, within that interval, there is no guarantee
Explain why “Score reliability limits predictability”
• If x will predict some external variable, how well it does so is limited by its precision (reliability)
if scores on X are imprecise, they will be unable to predict anything
By what is limited the theoretical maximum absolute value for the correlation between X and Y (AKA rXY)?
the correlation between X and Y is limited by (cannot exceed) the square root of the product of their two respective score reliabilities (rXX and rYY)
In which occasion can we say that a Pearson correlation can range from -1 to 1?
• The fact that a Pearson correlation can range from -1 to 1 is ONLY possible if the scores on X and Y are perfectly reliable (extremely rare)
What can we evaluate when there is no single variable for Y, but many different possible variables?
we can still evaluate convergent validity
• We can also evaluate discriminant validity
• We can’t really evaluate divergent validity - it doesnt exist
What is the Jingle Jangle fallacy?
Jingle: false belief that if 2 things have the same name, they are the same thing
• Ex: two tests named to be measuring depression might not measure the exact same thing about depression, they may not even measure depression at all - they may provide a super low correlation between their scores
Jangle: just because tests have different names does not mean that they measure different things
• Ex: one test measures self-esteem and another measures interest in gardening - the correlation between the scores of those tests is 0.9 - indicating a really high correlation and thus suggesting that they measure the same thing
Common method variance
AKA method effects
What is the method of a test?
The method for a test is the method of deriving scores
• What is the source of information?
• How does that information get processed?
The method used to collect the data itself can have a systematic influence on the scores
A way to separate the variation coming from the method of measurement from the variation that comes from variation in the construct
What are methods of measurement?
• Self-report (risk of distortion of answers)
• Observational
• Archival (AKA records - risk of issues in accuracy of the datasets)
Some methods induce some systematic effect - especially self-report, where people tend to subtly change their answers due to demand characteristics
Some tests have validity scales (like the MMPI) to detect any potential voluntary distortion of answers
Multitrait Method Matrix
A way to separate the variation coming from the method of measurement from the variation that comes from variation in the construct