Chapter 8 Flashcards
What’s an ideal item on a test for a norm-referenced test?
What about criterion?
Top scorers should get it correct, while low scorers should get it wrong.
The above doesn’t matter for criterion-based. An ideal test item is based on how well it assesses mastery.
Scaling definition
Process of settings rules for assigning numbers in measurement.
Stanine scale
Raw scores transformed into scores ranged from 1 to 9.
Rating scale
Records judgements of oneself, others, experiences, or objects
Summative scale
Final test score is a sum of all items
Method of paired comparisons
Asked to choose an option based out of two options.
Comparative Scaling
Sort options in comparison based on judgements. (eg. rank cards)
Categorical scaling
Sort objects into categories (eg. sorting cards to “justified” “sometimes justified” “always justified”
Guttman scale
Weaker to stronger expressions.
Agree with stronger ones will also agree with milder
Direct vs Indirect estimation
Direct (like equal-appearing intervals) transforms responses to another scale.
Indirect is no need to transform to another scale.
Selected-response vs. Constructed-response formats
Item formats. One is multiple options choose one, other is generate own answer.
3 types of selected-response item formats
MCs, matching, t/f
What are the names of the two columns in matching
Premises and responses
Completion item
Fill in the blank item
Computerized adaptive testing
What are the advantages of CAT?
Items are based on performance on previous items.
They reduce number of items needed and reduce measurement error (both by around 50%)
Floor vs. Ceiling effects
Floor: assessment tool is bad at distinguishing testtakers at the low end of what’s measured. (all too hard)
Ceiling: everything’s too easy.
Item Branching
Ability to customize content and order on the basis of previous responses.
Class Scoring or Category Scoring
Responders gets placed in a class or category with other responders based on their responses.
Ipsative scoring
what conclusions can be drawn?
Compare score on one scale to another scale within a same test.
Only appropriate for intraindividual comparison, not interindividual
What makes a good test item?
Can discriminate testtakers.
All high scorers getting a particular item wrong is bad sign. Same for opposite (low scorers and getting that item right).
Item analysis
Statistical procedures to analyze and identify good items for a test.
4 possible analyses for test items
Difficulty, Reliability, Validity, and item discrimination
How to calculate index of item’s difficulty?
item-difficulty index or
item-endorsement index
Just a proportion. correct/total number of people
What’s the ideal item difficulty? What should the range be?
.3 to .8, ideal = .5 for discrimination.
(basis of chance + 1) / 2
Item-reliability index
standard deviation multiplied by correlation of item score and total test score. Internal consistency.
Item-variability index
Degree to which a test measures what it purports to measure.
Item-discrimination index
lowercase “d”. Difference between high scorers (upper 25-33%) answering an item correctly and proportion of low scorers (lower 25-33%) answering it correctly.
Negative d is bad. Means low scorers answer it more correctly than high scorers.
Item-characteristic curve
A graphic representation of item difficulty and discrimination.
Probability of correct (y) and Ability (x)
What are biased test items?
Ones that prohibit item fairness by favoring a group.
How do ICCs helpp identify bias?
Different ICCs for different groups even if totals are the same.
How should item analysis people deal with speed tests? What’s the problem with analyzing speed tests?
Items at the end might be rushed or have no good results, leading to wrong interpretations of analyses.
Adminster the test with lots of time for item-analyses.
“think aloud” test administration
Testtaker thinks through his thought process and says out loud. Qualitative research. Sees if they using right line of thought.
Sensitivity review
Examine fairness of a test and look for stereotypes, offensive language, etc.
Cross-validation
revalidation of a test on a new sample.
Validity shrinkage
Items for a final version of a test will have lower item validity.
Co-validation
Co-norming
Using 2+ tests for same sample of testtakers.
Making or revising existing norms with 2+ tests for same sample.
Anchor Protocol
A test protocol scored by a highly authoritative scorer