Test Development Flashcards
It is the product of the thoughtful and sound application of established principles of test construction
Test development
1st step of Test development
Test conceptualization (what, how, who, when, should?)
Preliminary research surrounding the creation of a prototype of the test
Pilot study/research
2nd step of test development
Test construction
Process of setting rules for assigning number in measurement
scaling
Credited for being at the forefront of efforts to develop methodologically sound scaling methods
LL Thurstone
Type of scale the consists grouping of words, statement, symbols on which judgments of the strength of a particular trait, attitude, emotion are indicated by the test-taker
Rating scale
A scale where the final score is obtained by summing the ratings across all items (e.g. Likert Scale)
Summative scale
A scale where test takers are presented with pairs of stimuli which they are asked to compare
Method of paired comparison
Entails sorting tasks and judgments of a stimulus in comparison with every other stimulus on the scale (e.g. sort items from most justifiable to least justifiable)
Comparative scaling (ordinal)
Stimuli placed into one of two or more alternative categories that differ quantitatively with respect to some continuum
categorical scaling
Respondents who agree with stronger statements of the attitude will also agree with the milder statements
Guttman scale (ordinal)
Item analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses
Scalogram analysis
Scaling method used to obtain data that are presumed to be in interval in nature
Equal-appearing intervals (thurstone)
Reservoir or from which items will or will not be bdrawn for the final version of test
Item pool
Parts of a multiple-choice item format question
stem (sentence)
correct option
distractors/foils
Also called as short-answer item
Completion item
Limitations of essay items
Focus on a liimited area; subjectivity in scoring
Relatively large and easily accessible collection of test questions
item bank
Interactive, computer-administered test taking process wherein items presented to the testtaker are based in part on the testtaker’s performance on previous items
Computerized-adaptive testing (CAT)
Ability of the computer to tailor the content and order of the presentation of test items on the basis of responses to previous items
Item branching
Most commonly used scoring model
Cumulative scoring
A type of scoring used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify to a specific diagnosis
Class/categorical scoring
Compare testtaker’s score on one scale within a test to another scale within that same test
Ipsative scoring
3rd step in test development
test tryout
4th step in test development
Item analysis
Items that spur motivation and positive testtaking attitude and lessen anxiety
Give away items
Percent of people who said yes, agreed, endorsed the item not who pass the item
Item endorsement index
Range of the optimal item difficulty
0.3-0.8(easy)
Formula for OID
chance performance +1/2
OID for true-false item
0.75 (chance=0.5)
OID for multiple choice item 4 options
0.63 (chance=0.25)
OID for multiple choice item 5 options
0.60 (chance=0.2)
Equal to the product of the item-score standard deviation and the correlation between the item score and the total test score
Item reliability index
Item Analysis Technique for Questions with right/wrong answers
Item Difficulty
Item Discrimination
Distractor Analysis
Item Analysis Techniques for either right/wrong answers or self-report scales
Item reliability index
Cronbach’s alpha
Equal to the item score SD and correlation between item score and criterion score
Item validity index
How adequately an item separates or discriminates between high scorers and low scorers on the entire test
Item discrimination index
What are the key properties of the Item-discrimination index?
Symbolised by d
- Compares performance on a particular item by the high ability group & the low ability group
(i. e. the top 27% and the bottom 27%) - Items that discriminate well will have a high positive score (to a maximum of 1)
- A negative d value is a red flag as it means low test takers are doing better on that item than high test takers
The quality of each alternative within a multiple choice item can be readily assessed with reference to the comparatives performance of upper and lower scorers
Analysis of item alternatives (test developer can get an idea of the effectiveness of a distractor by means of a simple EYEBALL Test
Graphic representation of item difficulty and item discrimination
Item characteristic curve (the steeper the slope, the greater the item discrimination)
Test developer addresses the problem of guessing by including in the test manual…
- explicit instructions regarding this point for the examiner to convey to the examinees (ex. instruct answer only if certain)
- specific instructions for scoring and interpretting omitted items
Can be used to identify biased items
item characteristic curves
Different shapes of item-characteristic curves for different groups when 2 groups do not differ in total test score
Differential item functioning
Rely primarily on verbal rather than mathematical procedures to explore how individual test items work
Qualitative item analysis (thru group discussion, interviews)
Approach to cognitive assessment entails having respondents verbalize thoughts as they occur
think aloud test administration (one-on-one basis)
Conducted during the test development process in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes or situations
Sensitivity review
last step in test development
test revision
Test revision in the life cycle of an existing test
*APA suggests that an existing test be kept in its present form as long as it remains useful but that it should be revised when significant changes in the doman represented or new conditions of test use and interpretation make the test inappropriate for its intended use
Revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion
cross validation (key step in test development)
Decrease in item validities that inevitable occurs after corss-validation of findings
Validity shrinkage (is expected and integral to test development process)
Test validation conducted on 2 or more test using the same sample of testtakers
co-validation (also referred as co-norming)
Examiners undergo training of test administration using test manual
Quality assurance
A test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies; ensure consistency in scoring
anchor protocol
A discrepancy between scoring in an anchor protocol and the scoring of another protocol
scoring drift
Evaluate how well an individual item is working to measure different levels of the underlying construct
IRT information curves
Item functions differently in one group of testtakers as compared to another group as compared to another group of testtakers known to have the same level of difficulty of the underlying trait (by culture, gender, age)
Differential item functioning (DIF)
Test developers scrutinize group-by-group item response curves looking for DIF items
DIF analysis
Items that respondents from different groups at the same level of underlying trait have different probabilities of endorsing a function of their group membership
DIF items
An advantage of the response format of the test
Great breadth (cover many topics)