Ch. 8 - Test Development Flashcards
5 stages of test development
- conceptualization 2. construction 3. tryout 4. item analysis 5. test revision
test construction
process of writing possible test items
test tryout
administering a test to a representative sample of testtakers under conditions that simulate those of the final version of the test
some questions to ask when developing a new test
What is the test designed to measure? (what construct)
Is there a need for this test?
Who will use and take the test?
How will the test be administered?
Is there any potential for harm?
How will meaning be attributed to the scores on the test?
on a norm-referenced test, a good item is one that…
high scorers on the whole test get right
on a criterion referenced test, you need to do exploratory/pilot work with…
a group known to have mastered the skill
pilot work/study
Why is it done?
work done surrounding the creation of the prototype of a test
done to determine how to best measure a targeted construct
scaling
setting rules for assigned #s in measurement; the process by which a measuring device is designed and calibrated and by which #s (or other indices) AKA scale values are assigned to different amounts of the thing being measured.
stanine scale
all raw scores on the test can be transformed into scores that range from 1 to 9
age and grade-based scales
if testtakers’ performance is a function of age or grade is of critical interest
Likert scale
very reliable, has a scale of 1-5 or 1-7
rating scales
provide what kind of data?
grouping of words, statements, or symbols on which judgments of the strength of a particular thing are indicated by the testtaker
ALL rating scales provide ordinal data
method of paired comparisons
testtakers are presented with a pair of stimuli and must choose between then.
provide ordinal data
comparative scaling
testtaker must judge a stimulus in comparison with every other stimulus on the scale
catagorical scaling
stimului are placed into one of two or more alternative cateogires that differ quantitatively with respect to some continuum. For ex: sort into “never justified” “sometimes justified” “always justified”
Guttman scale
items on it range sequentially from weaker to stronger expressions of the thing. (everyone who agrees with the stronger statement agrees with the weaker ones). used in consumer research
AKA scalogram analysis
direct estimation vs indirect estimation
in direct estimation, you don’t need to transform a testtaker’s responses into some other scale. in indirect, you do need to transform those responses.
equal-appearing intervals method
the only rating scale described that has items that are interval in nature (ex: suicide scale) - there are presumed to be equal distances between the values on the scale (interval scale)
How many test items should an item pool contain for a multiple-choice test?
twice the number of the final number of test items
item pool
assembly of many items (from brainstorming all possibilities or many possibilities of test items)
selected-response format (item types)
multiple choice
matching
true-false (binary-choice item)
constructed-response format (item types)
completion item
short answer
essay
looking for synthesis of info
item bank
collection of GOOD test items. These items will continue to be selected and used or rotated. Finalized version of the item bank.
CAT
computerized adaptive testing - test-taking process wherein items presented to the testtaker are based on performance of previous items. may be displayed according to rules (e.g. only after you get 2 hard ones right, show the next level).
floor effect vs. ceiling effect reduced by?
CAT tends to reduce these. floor effect - not distinguishing between low scores/low ability
ceiling effect - not distinguishing between high scorers/high ability
item branching
the ability of a computer to tailor the content and order of presentation items on the basis of responses to previous items
CAT found to reduce…
# of test items (by 50%) measurement error (by 50%)
class scoring AKA
category scoring testtakers will be placed in a certain group/class with other testtakers whose pattern of responses was similar in some way (e.g., diganosis)
cumulative scoring model
higher the score, the higher the testtaker is on the thing being measured
ipsative scoring
compares testtakers score on one thing within a test to another thing on a test (thing = scale). comparing yourself with yourself.
e.g. Jon is cooler than he is smart BUT you can’t say Jon is cooler than Jenny
test tryout - how many people?
no fewer than 5 for EACH item on the test. more the better. preferably 10.
How do you tell whether a test item is good?
item analysis
generally, a good test item is answered correctly by high scorers on the test as a whole
item analysis
the statistical scrutiny of test data
what do you scritinize in item analysis?
item’s: difficulty, validity, relaibility, item discrimination (IDDRV)
item-difficulty index
expected values
denoted by lowercase italicized p
= to testtakers who got it correct / total testtakers
greater the p, the easier the item
if >.90 - is it really needed? other than as a giveaway/warmup for those with test anxiety
what is an item-difficulty index called on a personality test?
item-endorsement index
item-reliability index
shows the internal consistency of a test
higher the value, the greater the test’s internal consistency
=s*r (item SD * correlation between item score and test score)
item-validity index
measures the degree to which a test is measuring what it’s supposed to measure
item-discrimination index
measures how adequately an item separates/discriminates between high and low scores on the entire test
yields a lowercase italicized d
d compares performance on a particular item with performance on the upper and lower regions of a distribution of continuous test scores
higher the d, the higher the # of high scorers are answering it correctly
what does a high d value mean
higher the d, the higher the # of high scorers are answering a test item correctly
Bonus: if it’s a -d, that means that more low-scorers answer it correctly than high scorers
Bonus: if d = 0, same # of high and low scorers get it right
what does a high p mean
the test item is easy
analysis of item alternatives
for multiple-choice items, see how many people answered the distractors and evaluate them appropriately (e.g., maybe too distracting/wording needs to be changed)
item-characteristic curve
ICC - a graphic representation of item difficulty and discrimination
steep slope in ICC means?
greater item discrimination
“good” item looks like what in ICC
straight line with a slope
“good” item for a cutoff-score test or criterion-based test
looks like the top of an F
guessing - what issues does it present in item analysis?
- guesses are not made totally randomly
- how do we deal with omitted items?
- some people are luckier guessers than others
item fairness
the degree (if any) a test item is biased
biased test item
favor one particular group of examinees when differences in group ability are controlled
if an item is fair, its ICC should…
not be significantly differnt for different groups regardless of ability
item analysis in speed tests
yield misleading or uninterpretable results because items that are closer to the end appear more difficult just because few people were able to finish them
what are methods of qualitative item analysis
intervies, group discussions, “think aloud” test administration (sheds light on thought patterns), and sensitivity reviews
sensitivity reviews
an expert panel. items on a test are examined for fairness to all prospective testatkers, flag offensive language, stereotypes
test revision (as a stage in test development) - strategy
characterize each item according to its strengths and weaknesses
consider the purpose of the test - if for hiring and firing, eliminate biased items
if for culling most skilled performers - get items with the best item discrimination to ID the best of the best
sandardization
process used to introduce objectivity and uniformity into test administration, scoring, and interpretation
What do you need to do after item analysis?
administer the revised test under standardized conditions, then cross-validation
When should you revise a test?
stimulus look dated, dated vocabulary, offensive language, test norms aren’t adequate (group membership change), age-related shifts in the abilities over time, improve the reliability or validity of the test, theory on which test was based has improved
steps to revising an existing test
-all steps to make a new one (conceptualization, construction, tryout, item analysis, revision) + need to determine whether there is equivalence between the old and new versions of the test. likely scores will not mean the same thing (item analysis to evaluate stability of items between revisions of the same test)
cross-validation
what is inevitable?
re-validation of a test on a sample of testtakers other than the original group the test was found to be valid on. (aCROSS groups)
validity shrinkage is inevitable
co-validation
test validation process conducted on two or more tests using the same sample of testtakers (economical - test subjects ID’ed once, personnel costs)
co-norming
benefits?
co-validation on two tests and creating norms or revising existing norms
good for test users if tests are often used together bc they are normed on the same population (sampling error has been eliminated basically)
like co-validation, saves money
quality assurance in test revision
confirming that a test is given the same way
anchor protocol
test protocol scored by a highly trained scorer, designed as a model for scoring and mechanism for resolving scoring discrepancies
protocol drift
the discrepancy between and anchor protocol and another scorer’s protocol
differential item functioning
(DIF) - when an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same/similar level of underlying trait. This means that for some reason respondents from different groups have different probabilities of endorsing as a function of their group membership (ex: Asian women fear cultural shame around feeling depressed)