Final WA Flashcards
Of what is validity a property of?
A psychological test
When do we speak of construct underrepresentation?
When a test does not include the entire range of content of the construct that it intends to measure
Rachel analyzes an anxiety test and finds that the items form two factors. To what type of validity evidence does this finding contribute?
Internal structure
Alex developed a questionnaire for her research about neuroticism. She wants to know if her questionnaire is reasonably valid, that is, if it has sufficient construct validity. To this end, Alex asks researchers at her university – who are experts in this field – to rate each of the items in terms of how essential they are to the test.
What type of validity evidence is Alex evaluating here?
Test content
Is criterion validity a seperate category from convergent validity?
No, they are both a part of association with other variables
Can criterion validity be split up into predictive and discriminant validity?
No, predictive and concurrent
According to theory, neuroticism is positively correlated with anxiety. Keeping this association in mind, what evidence do you need to find that will indicate good associative validity of a neuroticism questionnaire?
Evidence on convergent validity
Consequential validity is about…
The uses of tests scores
According to wikipedia, the definition of arithmetic is:
“Arithmetic is an elementary part of mathematics that consists of the study of the properties of the traditional operations on numbers—addition, subtraction, multiplication, division, exponentiation, and extraction of roots.”
John creates an arithmetic test with items about addition, subtraction, multiplication, division, and exponentiation. What does this test suffer from?
The test suffers from construct underrepresentation
In an exam on the history of psychology, students can pass the exam simply by choosing the most elaborate answer option for each item. Which type of validity is being threatened?
Response process validity
In an exam on developmental psychology, students that have children of their own have a higher passing rate on the exam (85% of these students passes the exam) compared to students without children of their own (65% of these students passes the exam)
Which of the below causes for this difference forms a threat to consequential validity?
- This difference is due to the students with children being able to relate the course material to their own children making the exam easier for them
- This difference is due to the students with children studying harder for the exam than the students without children
- This difference arose as the students with children took the exam in the morning as the first exam of the day, while the students without children took the exam in the evening after two other exams
(1 and 2 are both fair. Studying hard and understanding something better because you can relate it to something in real life are both intended consequences, 3 is not.)
In the psychological methods bachelor’s program, there are courses on programming, advanced statistics, and philosophy of science. After this program, students can enroll into the Behavioral Data Science (BDS) master track.
Since 2017, students are only allowed into the track if they showed
-that they attended programming courses
-that they attended advanced statistics courses
Ever since this selection procedure was invoked, less students attended the philosophy of sciences courses from the psychological methods program.
What kind of consequential validity evidence is this?
Evidence of unintended systematic effects
A researcher wants to study the structural validity of a test measuring social anxiety. The test measures cognitive and affective aspects of social anxiety. The scree plot shows an inflection point at 2. On the basis of that plot, what would you conclude about the validity of this measurement.
If a validity is being threathened, which type?
These results show evidence against the structural validity of this measurement instrument
(bc only one factor is found, there should theoretically be two)
A researcher wants to study the structural validity of a test measuring dyslexia. The test measures reading and writing aspects of dyslexia. In a factor analysis he finds that the factors correlate .901 with each other. On the basis of these results, what would you conclude on the validity of this measurement instrument?
If a validity is being threathened, which type?
These results show evidence against the structural validity of this measurement instrument
For a sample of secondary school students, a researcher correlates the scores on a career-choice test to the actual university program that the students choose a year later. What kind of validity is being studied?
Predictive validity
A research establishes the correlation between a test on openness to experience and the variable “number followers on Instagram”.
What kind of validity does the researcher try to establish?
Concurrent validity
Why is it so important to use a reliable criterion measure when evaluating a validity correlation?
Because the validity coefficient will be attenuated when the criterion measure is not reliable
What is the relationship between reliability, validity and measurement error?
High measurement error –> poor reliability –> attenuated validity coefficient
We want to calculate the convergent validity correlation between two tests: A and B. On test A, a lot of people obtained the highest score, while on test B the scores are more normally distributed
How does this difference affect the validity coefficient?
Test A doesn’t reflect the full variability of behavior, as the range is restricted, which weakens the validity coefficient
We are interested in the validity correlation between two tests, test A (scores between 1 and 10) and test B (scores between 1 and 10) . However, before calculating the correlation between tests A and B, we select all subjects that score 3 or higher on test A. We then calculate the validity correlation between test A (scores 3 and up) and test B.
What will happen to the validity coefficient?
The validity coefficient will decrease, because the range of test A is restricted
Liz wants to investigate the validity of the attachment questionnaire she made for her research. To do that, she looks at the convergent validity correlation between her test and a self-esteem test, since theory suggests that secure attachment is correlated with high self-esteem. The self-esteem test is skewed to the right, what can you conclude about the validity coefficient?
The validity coefficient will be reduced, because the distribution of the self-esteem test is skewed
Liz wants to investigate the validity of the attachment questionnaire she made for her research. To do that, she looks at the convergent validity correlation between her test and a self-esteem test, since theory suggests that secure attachment is correlated with high self-esteem. Below are the probability distributions of the data she collected (i.e., respondents who filled in both questionnaires).
Liz discovers that in her sample, there are much more people with high self-esteem than with low self-esteem. How will this affect the validity?
As the self-esteem group sizes are not equal, the criterion variable is skewed, which leads to a lower validity coefficient
What is a risk of using predictive validity correlations when evaluating validity?
Underestimation of validity coefficients, because the variables are measured at different points in time
Gerry wants to evaluate the validity of his questionnaire about worrying. It was filled in by 150 respondents, who answered questions about how much they usually worry in various situations. Gerry discovered that his measure is affected by quite some measurement error. As a criterion variable, Gerry uses nervousness. From theory he knows that the correlation between worrying and nervousness is high. For his analysis, nervousness was measured through observation: the nervous behavior of the respondents was observed while they were filling in the questionnaire (i.e., and expert rated to what degree the respondents were nervous during the test session).
Which two factors weaken the validity coefficient in the case above?
Low reliability and prediction of single events
(bc using a single event as a criterion will weaken the correlation)
Which six factors, not including measurement error, affect the validity coefficient and why/how?
Association btwn constructs (stronger association = ^ correlation)
Restricted range (weakens correlation because it limits true variability) im pretty sure this also relates to weak discrimination if the range is restricted
Skew (if there is a difference in skew it weakens correlation bc without the distributions being similar the correlation is capped and can never be 1 (bc 1 can only be if something correlates with itself/aka same value).
> note that skew also happens when groups are unequal (or can happen)
Method variance (dissimilar methods = weaker correlation bc correlations based on a single method are inflated due to bias)
Time (greater time distance = weaker correlation bc of variables during interval)
Prediction of single events (single events = weaker correlation due to unreliability of measuring a single event/response)
Using factor analysis, what aspect of validity is the researcher studying?
Internal Structure
consider a MTMM matrix, if we focus on the substantial correlations between OS_self on the one side and WS_self and EM_self on the other side (respectively 0.42 and 0.40). (r btwn os_self and the other two)
What does this large correlation indicate?
Look at figure 1.2
These correlations indicate that the test has poor discriminant validity
We focus on the substantial correlations between OS_self on the one side and WS_self and EM_self on the other side (respectively 0.42 and 0.40).
How are these correlations called?
figure 1.2 might help, but not needed
Heterotrait-monomethod correlations
In an exam on developmental psychology, students that have children of their own have a higher passing rate on the exam (85% of these students passes the exam) compared to students without children of their own (65% of these students passes the exam)
Which of the below causes for this difference indicates DIF?
- This difference is due to the students with children being able to relate the course material to their own children making the exam easier for them
- This difference is due to the students with children studying harder for the exam than the students without children
- This difference arose as the students with children took the exam in the morning as the first exam of the day, while the students without children took the exam in the evening after two other exams
3
the other two are fair game
Which validity is threathened when DIF is present?
consequential validity
Two arithmetic items, item A and B are compared on their difficulty: Item A has a lower item difficulty parameter value as compared to item B. What can you conclude?
Item B will have a lower proportion of correct responses than item A
In an IRT analysis the item difficulty of item 9 is found to be 0.78. What can you conclude?
People with a trait level of 0.78 have a 50% chance of answering item 9 correctly
Karin and Sophia are administered a spelling test. Sophia has a spelling ability of θ = 1.5 and Karin has a spelling ability of θ = 0.6. Their teacher calculates the probability of a correct response on item 5 for Karin and for Sophia. The teacher uses a 2PL model. The probability that Sophia answers item 5 correctly is 0.8 and the probability that Karin answers item 5 correctly is 0.7.
How can it be that these probabilities are so similar, when in fact Karin’s and Sophia’s ability levels are quite different?
Item 5 is not good at differentiating between people with various ability levels
Say you developed a test to measure risky behavior in adolescents. The test consists of 16 items, that are scored on a 7-point Likert scale. The items are not equally good at differentiating between the risky behavior of the test-takers.
What model would you use to estimate the item and person parameters?
GRM
If you have 16 items that are scored on a 5-point Likert scale. How many difficulty parameters does a GRM model include for each item? Give a round number
4
In a 2PL with 15 items. How many parameters are there in total? Give a round number
30
In a 1PL with 20 items. How many parameters are there in total? Give a round number
20
In a 3PL with 25 items. How many parameters are there in total? Give a round number
75
In a GRM with 10 items that use a 5-point Likert scale. How many parameters are there in total? Give a round number
50
Look at figure 1.3, from which model do these curves originate (only 1)?
2PL
the curves differ in their slope
Look at figure 1.3, what is the approximate value for the difficulty of item 1?
-1
it’s like the middelish of the slope
Look at figure 1.4, which model is this?
3PL
guessing value present
Look at figure 1.4, which item has the worst discrimination and which one has the best?
worst = 1 and best = 3
What is the problem with a test that only has easy items, with regard to test information?
The test provides little information at high trait levels, since the items do not discriminate well among people with high trait levels
In a 2PL, what can you conclude about the item information if an item has a difficulty of 0.65?
At a trait level of 0.65, the item provides the most information
Consider the following situation, for a spatial ability item, older people score lower than younger people.
Which of the below indicates that this item has DIF?
- The older people score lower because the item used a picture from Instagram that most of the older people are unfamiliar with, while most of the younger people have seen that picture before.
check - The older people score lower because the test was too difficult for most of them while for most of the younger people the test was appriopriate.
3.The older people score lower because they have a lower spatial ability level than the younger people
1.
2 indicates that the latent trait was just lower for older people and 3 indicates that there is just a mean difference on the latent trait
What is the main benefit of using a CAT instead of a conventionnal test?
A CAT requires administration of fewer items
How would you select the items for a criterion references test? (difficulty wise)
You select mostly items of with difficulty close to the cut-off point