Test Development Flashcards
An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a new test or in response to a need to assess mastery in emerging occupations or professions.
Test Conceptualization
Criterion-referenced testing and assessment are commonly employed in _ and _ contexts.
Licensing
Educational context
The items that best discriminate between 2 groups would be considered the _ items.
Good items
A good items on a _ test is an item for which high scorers on the test respond correctly and low scorers respond incorrectly.
Norm-referenced test
The preliminary research surrounding the creation of a prototype of the test. It should be done to evaluate whether they should be included in the final form of the instrument.
Pilot work
The process by which a measuring device is designed and calibrated and by which numbers are assigned to different amounts of trait, attribute or characteristics being measured.
Scaling
He is the one credited for being at the forefront efforts to develop methodologically sound scaling methods.
L. L. Thurstone
Types of scales
Age-based scale
Grade-based scale
Stanine scale
A type of scale where all raw scores on the test are to be transformed into scores that can range from 1-9.
Stanine scale
The 3 scaling methods
Rating Scale
Summative scale
Likert scale
A grouping of words, statements or symbols on which judgments of the strength of a particular trait, attitude of emotion are indicated by the test taker.
Rating Scale
Test score is obtained by summing the rating across all the items.
Summative scale
A type of summative rating scale that is used extensively in psychology to scale attitudes. Each items present the testtakers with five alternative responses usually on an agree-disagree or approve-disappaprove continuum.
Likert scale
When one dimension is presumed to underlie the ratings.
Unidimensional
When more than 1 dimension is thought to guide the testtaker’s responses.
Multidimensional
What are the 4 scaling methods that produce ordinal data?
Method of paired comparison
Comparative scaling
Categorical scaling
Guttman scale
A scaling method that produces ordinal data. Testtakers are presented with pairs of stimuli which they are asked to compare, and they must select one of the stimuli according to some rule. Then they receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.
Method of Paired comparison
A scaling method that produces ordinal data. Stimuli such as printed cards, drawings, photographs or other objects are typically presented to testtakers for evaluation and must be sort from most justifiable to least justifiable. It could also be accomplished through the use of list of items on a sheet of paper.
Comparative scaling
A scaling method that produces ordinal data. Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.
Categorical scaling
A scaling method that produces ordinal data. Items on it range sequentially from weaker to stronger expressions of the attitude, belief or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements.
Guttman scale
The resulting data of Guttman scale are analyzed by the means of this. This is an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses.
Scalogram Analysis
The reservoir from which the item will or will not be drawn for the final version of the test. Items available for use as well as new items created especially for the item bank.
Item pool
It is the form, plan, structure, arrangement and layout of individual test items.
Item format
The two types of item format:
Selected response format
Constructed response format
It requires testtakers to select a response from a set of alternative responses.
Selected response format
3 Types of selected response format:
Multiple choice format
Matching item
True-false
Several incorrect alternatives or options in a multiple choice format are referred to as _.
Distractors or foils
A selected response format where the testtaker is presented with 2 columns where they have to determine which response is best associated with which premise.
Matching item
A multiple choice item format that contains only two possible responses (binary choice) (agree or not, yes or no, right or wrong, fact or opinion). It usually takes the form of a sentence.
True-false
3 types of constructed response items:
Completion item
Short-answer item
Essay
A constructed response format that requires the examinee to provide a word or phrase that completes a sentence.
Completion item
A constructed response format where a word, term, sentence or paragraph may qualify as an answer.
Short-answer item
A constructed response format that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis and/or interpretation.
Essay
A relatively large and easily accessible collection of test questions.
Item bank
An interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the test takers’ performance on previous items.
Computerized adaptive testing
It refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait or other attribute being measured. Testtakers who have not yet achieved such ability might fail all the items.
Floor effect
It refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the attribute being measured. It is likely that the test users who answered all of the items correctly conclude that the test was too easy.
Ceiling effect
The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to the previous items. Random presentation of test items.
Item branching
What are the 3 different scoring models?
Cumulative model
Class scoring or Category scoring
Ipsative scoring
Scoring model where the higher the score on the test, the higher the testtakers are on the ability or characteristic that the test purpots to measure.
Cumulative model
Scoring model where testtakers earn credit toward placement in a particular class or category with other testtaker whose pattern of responses is presumably similar in some way. Used by some diagnostic systems.
Class scoring or Category scoring
Scoring model that compares a testtaker’s score on one scale within a test to another scale within that same test.
Ipsative scoring
The informal rule of thumb for test tryout is that there should be no fewer than _ subjects and preferably as many as _ for each item on the test.
5
10
Factors that actually are just artifacts of the small sample size.
Phantom factors
A lowercase itallic “p” is used to denote _.
Item Difficulty
The larger the item difficulty index, the _ the item.
Easier
The optimal average item difficulty for maximum discrimination among the abilities of testtakers.
Approximate 0.5
The range of difficulty for individual items on the test.
0.3-0.8
For the possible effect of guessing, the optimal average item difficulty is usually the midpoint between _ and the chance success proportion.
1.00
The probability of answering correctly by random guessing.
Chance success proportion
The higher the index, the greater the test’s _.
Internal consistency
A statistical tool useful in determining whether items on a test appear to be measuring the same thing.
Factor analysis
It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.
Item-validity index
The higher the item validity index, the greater the test’s _.
Criterion-related validity
It compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.
Item discrimination index
Item discrimination index is symbolized by _.
Lowercase itallic “d”
The _ the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring testtakers.
Higher
The highest possible value of d.
+/-1.00
The value of d that indicates the item is not discriminating for there is the same proportion of members of the upper and lower groups who pass the item.
0
The lowest value that an index of item discrimination can take. It indicates that all members of the upper group failed the item and all members of the lower group passed it.
- -1
It is a graphic representation of item difficulty and discrimination. The steeper the slope, the greater the item discrimination.
Item-characteristic curves
It is an item that favors one particular group of examinees in relation to another when differences in group ability at e controlled.
Biased test item
It is exemplified by different shapes of item-characteristic curves for different groups when the 2 groups do not differ in total test score.
Differential item functioning
These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.
Qualitative methods
It is a general term for various no statistical procedures designed to explore how individual test items work. Involves exploration of the issues through verbal means.
Qualitative item analysis
A qualitative research tool designed to shed light on the testtaker’s thought processes during the administration of a test. They are asked to think aloud as they respond to each item.
“Think aloud” test administration
If a study of test items typically conducted during the test developmental processes in which the items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes or situations.
Sensitivity review
It refers to the revalidation of a test on a sample of testtakers other than those of whom test performance was originally found to be a valid predictor of some criterion.
Cross validation
The decrease in item validitities that inevitably occurs after cross-validation of finding.
Validaty shrinkage
Test validation process conducted on two or more tests using the same sample of testtakers.
Co-validation
Used in conjunction with the creation of norms or the revision of existing norms.
Co-norming
The discrepancies of the scorers are resolved by another scorer which is called?
Resolver
Scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.
Anchor protocol
Discrepancy between scoring in an anchor protocol and the scoring of another protocol.
Scoring drift
Those items that respondents from different groups at the same level have underlying trait have different probabilities of endorsing as a function of their group membership.
Differential item functioning (DIF) items