Psychological Testing #2 Flashcards
*Restriction of range
Makes test-retest reliability low, because not very many subjects.
What is standard error of measurement?
Theoretically, if the subject took many tests, their various scores would result in a normal curve. That curve would have units of standard deviation. This SD unit is an SEM.
What is a confidence interval?
A confidence interval is the amount of confidence we have that our score falls within a certain range, based on the intervals of SEM. It is given in a percentage.
What is the standard error of the difference?
A statistical measure that can help a test user determine whether the difference between scores is significant. It is usually used for sub-scores on a test.
Ch4: What is validity?
Does the test measure what it claims to measure?
A test is valid to the extent that inferences made from it are appropriate meaningful, and useful.
What is the relationship between validity and reliability?
If a test is not reliable it’s not going to be valid. However a reliable test can be invalid. Something can be consistently bad. (You have to understand the relationship between reliability and validity.)
What do we mean by a continuum of validity?
Validity cannot be captured in statistical summaries, instead it is on a continuum ranging from weak to acceptable to strong, based on the three types of validity evidence.
What are the three categories of accumulating validity evidence?
Content validity
Criterion-related validity
Construct-validity
An ideal validation includes several types of evidence in all three categories.
What is face-validity?
Well for one, it’s not actually validity. It’s how the test looks to examinees. It’s important because it can impact a person’s approach to the test. It’s loosely related to content validity.
What is content validity?
Content validity is determined by the degree to which the questions, tasks, or items on a test are representative of the universe of behavior the test was designed to sample. Especially useful when a great deal is known about the construct.
Item sampling - (Behavior) Do the items on the test fit the content for what you’re wanting to test. If I’m testing 4th grade math level, and I examine skills that aren’t taught until 5th grade, then that’s poor content validity.
Types of skills - (Responses) Multiple choice or open ended?
“Expert review” is often the choice of evidence.
What is criterion-related validity?
The test score is compared to an outcome measure (criterion). The criterion can be concurrent, e.g. people take a new IQ test and and established IQ test at the same time. The criterion can also be predictive, like in college readiness tests and employment tests.
What makes a good criterion for criterion-related validity?
RELIABLE - consistency of scores.
APPROPRIATE - Well duh, but actually sometimes this can be tricky. Should the criterion measure of an aptitude test indicate satisfaction, success, or continuance in the activity?
FREE FROM THE CONTAMINATION OF THE TEST - This is where that becomes a problem, when your criterion becomes contaminated because of the test score. I want to see if this is useful, but you already used to test to determine who you hired.It can also be contaminated by overlap between questions, e.g. if both tests ask about eating habits and sleeping habits will artificially inflate the correlation.
What is decision theory?
The purpose of psychological testing is not measurement for its own sake, but measurement in the service of decision making.
Making decisions based on test scores results in a matrix of outcomes. With hits and misses (false positives and false negatives). You have to determine where you want your mistakes to be.
What is construct validity?
A construct is a theoretical, intangible quality or trait in which individuals differ. Construct validity is theory based: Based on my understanding of this particular construct, what would I expect to see in a test?
No criterion or universe of content is accepted as entirely adequate to define the quality to be measured, so a variety of evidence is required to establish construct validity.
What is test homogeneity?
A measure of construct validity.
Does it measure a single construct?
If my theory about this is a unitary construct and I do internal consistency and it looks like it’s just one construct. It could be measuring one thing, but it might not be the right thing.
What are appropriate developmental changes?
A measure of construct validity.Is my construct something that changes as people age?Ego-centrism would have different results. The scores should go down as kids get older.
What are theory-consistent group differences?
A measure of construct validity.Can we predict who will have high and low scores for this construct?Different rates of extroversion in different professions. Nuns are high in social interest. Models and criminals are low in social interest.
What are theory-consistent intervention effects?
A measure of construct validity.Does the construct change in the appropriate direction after intervention/treatment?People’s scores of spatial orientation should increase after training, more than those who did not receive training.
What is convergent and discrimination validation?
A measure of construct validity.What should it correlate with and what should it be different from? Intelligence and social interest are theoretically unrelated.Anxiety and eating disorders overlap.
What is factor analysis?
A measure of construct validity.How many factors are you actually measuring?If you think you’re measuring three factors, and a factor analysis shows three factors, that’s a good sign.
What is classification accuracy?
A measure of construct validity.
How well does it give accurate identification of test takers? Test makers strive for high levels of:
SENSITIVITY: Accurate identification of patients who have a syndrome.
SPECIFICITY: Accurate identification of normal patients.
These are measured by percentages. Sensitivity: 79% (correctly identifies 79% of affected individuals). Specificity: 83% (correctly identifies 79% of unaffected individuals).
What are extravalidity concerns?
Side effects and unintended consequences of testing.
What are some of the unintended side effects of testing?
How do we prevent extravalidity problems?
AKA Extravalidity concerns.
Children identified my feel unusual or dumb. Legal consequences. Test should also be evaluated for (1) values in interpretation, (2) usefulness in particular application, and (3) potential and actual social consequences. Along with traditional validity.
What does NOIR stand for?
Nominal
Ordinal
Interval
Ratio
What is a nominal scale?
Where the scales are simply categories, without any absolute order.
Male = 1, Female = 2.
What is an ordinal scale?
A scale with categories following a specific order, but the distance between the categories is variable.
Freshman, Sophomore, Junior, Senior.
Ranking something from most liked to least liked
What is an interval scale?
A scale in which the units have an order and equal distance between each unit. It does not posses an absolute 0. A Likert scale is considered an interval scale for statistical purposes.
What is a ratio scale?
A ratio scale is rare in psychological measurement. A scale with an absolute 0, which also allows for categorization, ranking, and intervals.
What are some scaling methods? Which ones are best?
"No single scaling method is uniformly better than the others." Expert Ranking Likert scales Guttman scales Empirical keying Rational scale construction
What’s an example of expert ranking?
The Glasgow Coma Scale
How would experts rank each of these responses.
What are methods of absolute scaling?
A procedure for obtaining a measure of absolute item difficulty based on different age groups of test takers. You don’t want questions to be bunched around certain ages and leave gaps at others.
What is empirical keying?
You develop a long list of questions and try them out on contrasting groups (depressed/not depressed, delinquents/non-delinquents) and try and see if the groups answer the questions differently.
What is the heart of the method of rational scaling?
That all the scale items correlate positively with each other and also with the total score for the scale. The questions need to correlate with each other, or we won’t keep them.
What are the initial questions of test construction?
Range of difficulty
Item format
Item difficulty
Item-discrimination
How would range of difficulty be different for different types of tests?
Norm-referenced tests would have a greater range of difficulty, because we want to know who the outliers our.
Criterion-referenced tests would be more restricted, because no one cares if you’re in the 99th percentile of drivers on your driving test.
What are some examples of item format and what are their strengths and weaknesses?
A multiple choice questions can capture conceptual as well as factual knowledge and can be easily judge for fairness based on statistics. However, they can be difficult to write with good distractors, and they can can queue a half knowledgeable respondent.
Matching questions are problematic because the responses may not be independent.
True/false questions can be easy to understand but people may choose the most desirable answer.
Forced choice questions can prevent people from picking the most desirable option, but they haven’t been embraced yet by test developers.
What are the best types of items to use?
It depends on the test.
How do we measure item difficulty?
We measure how many people get the item correct.
An item with a .3 is an item 3% of people got correct. So it’s hard. An easier question would be a .8.
Generally, item difficulty hovers around .5 with a range of .3-.7, but this will change depending on the type of test.
What are the two types of item-discrimination?
- High vs low scorers - If a lot of the high scoring people get it right, and the low scoring people get it wrong, it’s a good question. So what if most of the people As and Bs get it wrong and the people who get Cs and Ds get it right? Then there might be a problem with the key or the question is poorly worded.
- Analysis of item choices - What was the variability of the choices? Did everyone guess A and B and no one guesses C or D? Then C and D are wastes of space. You want good distractors. Occasionally, B could be too close to A, so you want to make the distractor less like the actual answers.
What is cross-validation and how is it related to validity shrinkage?
Cross-validation means using the original regression equation in a new sample to determine whether the test predicts the criterion as well as it did in the original sample. Because the test was developed based on the original sample, it follows that it would correlate less with the second sample. This phenomenon is called validity shrinkage.
How might you get feedback from examinees, and how will that contribute to test development?
You can give questionnaires to the examinees after the test or you can have them think aloud about it in an open-ended manner.
The Inter-University entrance exam was modified in numerous ways in response to feedback. Time limits on some sections were increased. Perceived culturally unfair items were deleted.
How are testing materials important?
Tri-fold board instead of just one piece of cardboard. Books that stand up on their own. Intelligence tests have a lot of components that need to be manipulated, and on top of those manuals, stopwatches, and small children.
What are the two manuals you need for a test and why?
Technical manual and user’s manual - A test user needs both of these. The technical manual tells you the background and helps you determine if you want to use the test.
What is an real definition and how is it different from an operational definition?
A real definition is one that seeks to tell us the true nature of the thing being defined. An operational definition is a definition of a concept in terms of the way it is measured.
What are the shortcomings of operational definitions of intelligence?
They are circular: “What the tests test.”
They block further progress in understanding the nature of intelligence.
How does the textbook define intelligence?
Intelligence is:
- The capacity to learn from experience.
- The capacity to adapt to one’s environment.
These two themes occur again and again in definitions of intelligence. Many textbooks also include the ability to engage in abstract reasoning.