W3 - Chapter 8 - Test Development - DN Flashcards
anchor protocol
- a test answer sheet
- developed by a test publisher
- to test the accuracy of examiners’ scoring
p.280
biased test item
- an item that favours one group in relation to another
- when differences in group ability are controlled
p.271
binary-choice item
- multiple choice item
- contains only two possible responses (true-false)
p.254
categorical scaling
- system of scaling
- stimuli placed in one of two or more alternative categories that differ quantitatively with respect to some continuum
p.249
categorical scoring
- a method of evaluation
- where test responses earn credit toward placement in a particular class/category
- sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category
- also called class scoring
- contrast with cumulative scoring & ipsative scoring
p.260
ceiling effect
-
diminished utility of a tool of assessment in distinguishing testtakers at the high end of the ability, trait, or other attribute being measured
p. 259, 307
class scoring
- a method of evaluation
- where test responses earn credit toward placement in a particular class/category
- sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category
- contrast with cumulative scoring & ipsative scoring
p.260
comparative scaling
- in test development
- a method of developing ordinal scales
- through the use of a sorting task
- entails judging a stimulus in comparison with every other stimulus used on the test
p.249
completion item
- requires an examinee to provide a word or phrase that completes a sentence
p. 254
computerized adaptive testing (CAT)
- an interactive, computer-administered testtaking process
- items are presented to the testtaker, based in part on the testtakers’ performance on previous items
p.15, 255-256
co-norming
- the test norming process conducted on two or more tests
- using the same sample of testtakers
- when used to validate all of the tests being normed, this process may also be referred to as co-validation
p.138n4, 278
constructed-response format
- a form of test item requiring a testtaker to construct or create a response
- as opposed to simply selecting a response
- contrast with selected-response format
p.252
co-validation
- when co-norming is used to validate all of the tests being normed
- this process may also be referred to as co-validation
p.278
cross-validation
- a revalidation on a sample of testtakers
- other than the testtakers on whom test performance was originally found to be a valid predictor of some criterion
p.278
essay item
- a test item that requires a testtaker to write a composition
- typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation
p.255
expert panel
- in test development process
- group of people knowledgeable about - the subject matter being tested, and/or the population for whom the test is being designed
- they can provide input to improve test’s content, fairness etc.
p.274-275
floor effect
- a phenomenon arising from the diminished utility of a tool of assessment in distinguishing testtakers at the low end of the ability, trait, or other attribute being measured
p. 256-259
giveaway item
- a test item, usually near the beginning of a test of ability or achievement
- designed to be relatively easy
- usually for the purpose of building the testtakers confidence or reducing test-related anxiety
p.263n4
What three criteria must be met when correcting for the impact of guessing?
- must recognize that guesses are not normally totally random
- must deal with the problem of omitted items
- some testtakers are lucky and others unlucky
p.269-271
Guttman scale
- a scale - items range sequentially from weaker to stronger expressions of the attitude or belief being measured
- constructed so that selection of an earlier item presumes that all following items are also true of the testtaker
- named after its developer
p.249
ipsative scoring
- approach to scoring & interpretation
- responses & presumed strength of measured trait are interpreted relative to the measured strength of other traits for that testtaker
- contrast with class scoring & cumulative scoring
p.260
item analysis
- general term used to describe various procedures
- usually statistical, designed to explore how individual items work compared to others in the test & in the context of the whole test
- e.g., to explore the level of difficulty of individual items on an achievement test
- e.g., to explore the reliability of a personality test
- contrast with qualitative item analysis
p.262-275
item bank
- a collection of questions to be used in the construction of a test
p. 255, 257-259, 282-284
item branching
- in computerised adaptive testing (CAT)
- the individualised presentation of test items drawn from an item bank based on the testtakers’ previous responses
p.260
item-characteristic curve (ICC)
- graphic representation of the probalistic relationship between a person’s level of trait (ability, characteristic) being measured and the probability for responding to an item in a predicted way
- also known as a category response curve or an item trace line
p.177, 281 p.268
item-difficulty index
- items cannot be too easy or too hard in order to differentiate between testtakers knowledge of the subject matter
- a statistic obtained by calculating the proportion of the total number of testtakers who answered an item correctly
- p is used to denote item difficulty
- a subscript 1 refers to the item number = p1
- can range from 0-1
- the larger the item-difficulty index, the easier the item
- (i.e., the higher the p, the easier the item - because p represents the number of people passing the item)
p.263-264
item-discrimination index
- measure of item discrimination
- symbolised by d
p.264-268
item-endorsement index
- the name given to an item-difficulty test (which is used in achievement testing) when used in other contexts (e.g., personality testing)
p. 263
item fairness
- a reference to the degree of bias, if any, in a test item
p. 271-272
item format
- a reference to the form, plan, structure, arrangement, or layout of individual test items
- including whether the test items require testtakers to select or create a response
p.252-255
item pool
- the reservoir or well from which items will or will not be drawn for the final version of the test
- the collection of items to be further evaluated for possible selection for use in an item bank
p.251
item-reliability index
- provides an indication of the internal consistency of a test
- the higher the index, the greater the internal consistency
- index is equal to
- the product of the item-score standard deviation (s) and
- the correlation (r) between the item score and the total test score
p.264
item-validity index
- a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure
- important when a test developer’s goal is to maximise the criterion-related validity of a test
- the higher the item-validity index, the greater the test’s criterion-related validity
- to calculate we must first know
- the item-score standard deviation (symbolised as s1, s2, s3 etc.)
- and the correlation between the item score and the criterion score
- then we use the item difficulty index p1 in the following formula
- s1 = square root of p1 (1 - p1)
- the correlation between the score on item 1 and a score on a criterion measure (r1c) is multiplied by item 1’s item-score standard deviation (s1)
- the product is an index of an items validity (s1 r1c)
p.264
Likert scale
-
summative rating scale with 5 alternative responses
- ranging on a continuum from e.g., “strongly agree” to “strongly disagree”
p.247
matching item
- the testtaker is presented with two columns
- premises on the left & responses on the right
- task is to determine which response is best matched to which premise
- young testtakers (draw a line)
- others typically asked to write a letter/number as a response
p.253
method of paired comparisons
- a scaling method
- a pair of stimuli (e.g., photos) is selected according to a rule
- (e.g., “select the one that is more appealing”) p.248
multiple-choice format
- one of the three types of selected-response item formats
- three elements
- a stem
- a correct alternative or option
- and several incorrect alternatives (referred to as distractors or foils)
p.252
pilot work
- also referred to as pilot study & pilot research
- preliminary research surrounding the creation of a prototype test
- general objective is to determine how best to
- gauge
- assess, or
- evaluate the targeted construct(s)
p.243-244
qualitative item analysis
- non-statistical procedures designed to explore how individual test items work
- both compared to other items in the test & in the context of the whole test
- unlike statistical measures, they involve exploration of the issues by verbal means
- (e.g., interviews & group discussions with testtakers & other relevant parties)
p.272-275
qualitative methods
- techniques of data generation & analysis
- rely primarily on verbal rather than mathematical or statistical procedures
p.272
rating scale
- a system of ordered numerical or verbal descriptors
- used to make judgements about the presence, absence, or magnitude of a particular trait, attitude, emotion, or other variable
p.205, 247, 371
scaling
- 1) in test construction
- the process of setting rules for assigning numbers in measurement
- 2) the process by which a measuring device
- is designed and calibrated &
- the way numbers (or other indices) are assigned to different amounts of a trait, attribute, or characteristic being measured
p.244-251
scalogram analysis
- an item-analysis procedure
- entails graphic mapping of a testtaker’s responses
p.250
scoring drift
- a discrepancy between the scoring in an anchor protocol and the scoring of another protocol
p. 280
selected-response format
- a form of test item
- requiring testtakers to select a response
- (e.g., true/false, multiple choice, and matching items)
- as opposed to creating one - contrast with constructed-response format p.252
sensitivity review
- a study of test items
- usually during test development
- items are examined for fairness to all prospective testtakers
- for the presence of offensive language, stereotypes, or situations
p.274
short-answer item
- may also be referred to as a completion item
- a word, term, sentence or a paragraph may qualify
- anything beyond this is an essay item
p.254
summative scale
- an index derived from the summing of selected scores on a test or sub-test
p. 247
test conceptualization
- an early stage of the test development process
- when an idea for a particular test or test revision is conceived
p.240, 241-244
test construction
- a stage in the process of test development
- entails writing test items (or rewriting/revising existing items)
- as well as formatting items, setting scoring rules, and otherwise designing and building a test
p.240
test development
- an umbrella term for all that goes into the process of creating a test
p. 240-284
test revision
- action taken to modify a test’s content or format
- for the purpose of improving the test’s effectiveness as a tool of measurement
p.240
test tryout
- a stage in the process of test development that entails administering a preliminary version of a test to a representative sample of testtakers
- under conditions that simulate the conditions under which the final version of the test will be administered
p.240, 261-262
“think aloud” test administration
- a method of qualitative item analysis
- examinees verbalize their thoughts as they take the test
- useful in understanding how
- individual items function in a test
- testtakers interpret or misinterpret the meaning of the individual items
p.274
true-false item
- a binary-choice item
- i.e., contains only one of two responses
- requires testtaker to indicate whether a statement is or is not a fact
p.254
validity shrinkage
- the decrease in item validities that inevitably occurs after cross-validation
p. 278
What is the optimal item difficulty?
- usually midpoint between 1.0 and the probability of answering correctly by guessing
- which is called the chance success proportion
- multi choice (50% chance of getting it right by guessing) - .5 +1.00 = 1.5 divided by 2 = .60 10:00
- which is called the chance success proportion
p.263
How can you create a visual representation of the best items on a test
(i.e., if the objective is to maximise criterion-related validity)?
- this can be achieved by plotting each item’s
- item-validity index and
- item-reliability index
p.265
Fig 8-5