Ch. 8 - Test Development Flashcards
5 stages of test development
- conceptualization 2. construction 3. tryout 4. item analysis 5. test revision
test construction
process of writing possible test items
test tryout
administering a test to a representative sample of testtakers under conditions that simulate those of the final version of the test
some questions to ask when developing a new test
What is the test designed to measure? (what construct)
Is there a need for this test?
Who will use and take the test?
How will the test be administered?
Is there any potential for harm?
How will meaning be attributed to the scores on the test?
on a norm-referenced test, a good item is one that…
high scorers on the whole test get right
on a criterion referenced test, you need to do exploratory/pilot work with…
a group known to have mastered the skill
pilot work/study
Why is it done?
work done surrounding the creation of the prototype of a test
done to determine how to best measure a targeted construct
scaling
setting rules for assigned #s in measurement; the process by which a measuring device is designed and calibrated and by which #s (or other indices) AKA scale values are assigned to different amounts of the thing being measured.
stanine scale
all raw scores on the test can be transformed into scores that range from 1 to 9
age and grade-based scales
if testtakers’ performance is a function of age or grade is of critical interest
Likert scale
very reliable, has a scale of 1-5 or 1-7
rating scales
provide what kind of data?
grouping of words, statements, or symbols on which judgments of the strength of a particular thing are indicated by the testtaker
ALL rating scales provide ordinal data
method of paired comparisons
testtakers are presented with a pair of stimuli and must choose between then.
provide ordinal data
comparative scaling
testtaker must judge a stimulus in comparison with every other stimulus on the scale
catagorical scaling
stimului are placed into one of two or more alternative cateogires that differ quantitatively with respect to some continuum. For ex: sort into “never justified” “sometimes justified” “always justified”
Guttman scale
items on it range sequentially from weaker to stronger expressions of the thing. (everyone who agrees with the stronger statement agrees with the weaker ones). used in consumer research
AKA scalogram analysis
direct estimation vs indirect estimation
in direct estimation, you don’t need to transform a testtaker’s responses into some other scale. in indirect, you do need to transform those responses.
equal-appearing intervals method
the only rating scale described that has items that are interval in nature (ex: suicide scale) - there are presumed to be equal distances between the values on the scale (interval scale)
How many test items should an item pool contain for a multiple-choice test?
twice the number of the final number of test items
item pool
assembly of many items (from brainstorming all possibilities or many possibilities of test items)
selected-response format (item types)
multiple choice
matching
true-false (binary-choice item)
constructed-response format (item types)
completion item
short answer
essay
looking for synthesis of info
item bank
collection of GOOD test items. These items will continue to be selected and used or rotated. Finalized version of the item bank.
CAT
computerized adaptive testing - test-taking process wherein items presented to the testtaker are based on performance of previous items. may be displayed according to rules (e.g. only after you get 2 hard ones right, show the next level).
floor effect vs. ceiling effect reduced by?
CAT tends to reduce these. floor effect - not distinguishing between low scores/low ability
ceiling effect - not distinguishing between high scorers/high ability
item branching
the ability of a computer to tailor the content and order of presentation items on the basis of responses to previous items