W3 - Chapter 8 - Test Development - DN Flashcards
1
Q
anchor protocol
A
- a test answer sheet
- developed by a test publisher
- to test the accuracy of examiners’ scoring
p.280
2
Q
biased test item
A
- an item that favours one group in relation to another
- when differences in group ability are controlled
p.271
3
Q
binary-choice item
A
- multiple choice item
- contains only two possible responses (true-false)
p.254
4
Q
categorical scaling
A
- system of scaling
- stimuli placed in one of two or more alternative categories that differ quantitatively with respect to some continuum
p.249
5
Q
categorical scoring
A
- a method of evaluation
- where test responses earn credit toward placement in a particular class/category
- sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category
- also called class scoring
- contrast with cumulative scoring & ipsative scoring
p.260
6
Q
ceiling effect
A
-
diminished utility of a tool of assessment in distinguishing testtakers at the high end of the ability, trait, or other attribute being measured
p. 259, 307
7
Q
class scoring
A
- a method of evaluation
- where test responses earn credit toward placement in a particular class/category
- sometimes testtakers must meet a set number of responses corresponding to a particular criterion to be placed in a specific category
- contrast with cumulative scoring & ipsative scoring
p.260
8
Q
comparative scaling
A
- in test development
- a method of developing ordinal scales
- through the use of a **sorting task **
- entails judging a stimulus in comparison with every other stimulus used on the test
p.249
9
Q
completion item
A
- requires an examinee to provide a word or phrase that completes a sentence
p. 254
10
Q
computerized adaptive testing (CAT)
A
- an interactive, computer-administered testtaking process
- items are presented to the testtaker, based in part on the testtakers’ performance on previous items
p.15, 255-256
11
Q
co-norming
A
- the test norming process conducted on two or more tests
- using the same sample of testtakers
- when used to validate all of the tests being normed, this process may also be referred to as co-validation
p.138n4, 278
12
Q
constructed-response format
A
- a form of test item requiring a testtaker to construct or create a response
- as opposed to simply selecting a response
- contrast with selected-response format
p.252
13
Q
co-validation
A
- when co-norming is used to validate all of the tests being normed
- this process may also be referred to as co-validation
p.278
14
Q
cross-validation
A
- a revalidation on a sample of testtakers
- other than the testtakers on whom test performance was originally found to be a valid predictor of some criterion
p.278
15
Q
essay item
A
- a test item that requires a testtaker to write a composition
- typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation
p.255
16
Q
expert panel
A
- in test development process
- group of people knowledgeable about - the subject matter being tested, and/or the population for whom the test is being designed
- they can provide input to improve test’s content, fairness etc.
p.274-275
17
Q
floor effect
A
- a phenomenon arising from the diminished utility of a tool of assessment in distinguishing testtakers at the low end of the ability, trait, or other attribute being measured
p. 256-259
18
Q
giveaway item
A
- a test item, usually near the beginning of a test of ability or achievement
- designed to be relatively easy
- usually for the purpose of building the testtakers confidence or reducing test-related anxiety
p.263n4
19
Q
What three criteria must be met when correcting for the impact of guessing?
A
- must recognize that guesses are not normally totally random
- must deal with the problem of omitted items
- some testtakers are lucky and others unlucky
p.269-271
20
Q
Guttman scale
A
- a scale - items range sequentially from weaker to stronger expressions of the attitude or belief being measured
- constructed so that selection of an earlier item presumes that all following items are also true of the testtaker
- named after its developer
p.249
21
Q
ipsative scoring
A
- approach to scoring & interpretation
- responses & presumed strength of measured trait are interpreted relative to the measured strength of other traits for that testtaker
- contrast with class scoring & cumulative scoring
p.260
22
Q
item analysis
A
- general term used to describe various procedures
- usually statistical, designed to explore how individual items work compared to others in the test & in the context of the whole test
- e.g., to explore the level of difficulty of individual items on an achievement test
- e.g., to explore the reliability of a personality test
- contrast with qualitative item analysis
p.262-275
23
Q
item bank
A
- a collection of questions to be used in the construction of a test
p. 255, 257-259, 282-284
24
Q
item branching
A
- in computerised adaptive testing (CAT)
- the individualised presentation of test items drawn from an item bank based on the testtakers’ previous responses
p.260
25
item-characteristic curve (ICC)
* **graphic** representation of the **probalistic relationship** between a person's **level of trait (**ability, characteristic) being measured and the **probability** for **responding** to an item in a **predicted** way
* also known as a category response curve or an item trace line
p.177, 281 p.268
26
item-difficulty index
* items cannot be too easy or too hard in order to differentiate between testtakers knowledge of the subject matter
* a statistic obtained by calculating the **proportion** of the **total number** of **testtakers** who answered an item **correctly**
* *p* is used to denote item difficulty
* a subscript 1 refers to the item number = *p*1
* can **range from 0-1**
* the larger the item-difficulty index, the easier the item
* (i.e., the higher the p, the easier the item - because ***p* represents** the **number of people** **passing** the item)
p.263-264
27
item-discrimination index
* measure of item discrimination
* symbolised by *d*
p.264-268
28
item-endorsement index
* the name given to an item-difficulty test (which is used in achievement testing) when used in **other contexts** (e.g., personality testing)
p. 263
29
item fairness
* a reference to the **degree of bias**, if any, in a test item
p. 271-272
30
item format
* a reference to the **form, plan, structure, arrangement,** or **layout** of individual test items
* including whether the test items require testtakers to **select or create** a response
p.252-255
31
item pool
* the reservoir or well from which items will or will not be **drawn** for the final version of the test
* the **collection of item**s to be further **evaluated** for **possible selection** for use in an **item bank**
p.251
32
item-reliability index
* provides an indication of the **internal consistency** of a test
* the **higher the index**, the greater the internal consistency
* index is equal to
* the product of the item-score standard deviation (*s*) and
* the correlation (*r*) between the item score and the total test score
p.264
33
item-validity index
* a statistic designed to provide an indication of the **degree** to which a **test is measuring** what it **purports to measure**
* **important** when a test developer's **goal** is to maximise the **criterion-related validity** of a test
* the higher the item-validity index, the greater the test's criterion-related validity
* to calculate we must first know
* the item-score standard deviation (symbolised as *s*1, *s*2, *s*3 etc.)
* and the correlation between the item score and the criterion score
* then we use the item difficulty index *p*1 in the following formula
* *s*1 = square root of *p*1 (1 - *p*1)
* the correlation between the score on item 1 and a score on a criterion measure (*r*1c) is multiplied by item 1's item-score standard deviation (*s*1)
* the product is an **index of an items validity (*s*1 *r*1c)**
p.264
34
Likert scale
* **summative rating scale** with **5 alternative responses**
* ranging on a continuum from e.g., "strongly agree" to "strongly disagree"
p.247
35
matching item
* the testtaker is presented with two columns
* *premises* on the left & *responses* on the right
* task is to determine which response is best matched to which premise
* young testtakers (draw a line)
* others typically asked to write a letter/number as a response
p.253
36
method of paired comparisons
* a **scaling** method
* a **pair of stimuli** (e.g., photos) is selected **according to a rule**
* (e.g., "select the one that is more appealing") p.248
37
multiple-choice format
* one of the three types of **selected-response** item formats
* three elements
1. a stem
2. a correct alternative or option
3. and several incorrect alternatives (referred to as distractors or foils)
p.252
38
pilot work
* also referred to as pilot study & pilot research
* **preliminary research** surrounding the creation of a prototype test
* general objective is to determine how best to
* **gauge**
* **assess**, or
* **evaluate** the **targeted construct**(s)
p.243-244
39
qualitative item analysis
* **non-statistical** procedures designed to explore how individual test items work
* both compared to **other items** in the test & in the **context** of the **whole test**
* unlike statistical measures, they involve **exploration** of the issues by **verbal means**
* (e.g., interviews & group discussions with testtakers & other relevant parties)
p.272-275
40
qualitative methods
* techniques of **data generation & analysis**
* rely primarily on **verbal** rather than mathematical or statistical procedures
p.272
41
rating scale
* a system of **ordered numerical** or **verbal descriptors**
* used to make **judgements** about the **presence, absence, or magnitude** of a particular trait, attitude, emotion, or other variable
p.205, 247, 371
42
scaling
* 1) in **test construction**
* the process of **setting rules** for **assigning numbers** in measurement
* 2) the process by which a measuring device
* is designed and calibrated &
* the way numbers (or other indices) are assigned to different amounts of a trait, attribute, or characteristic being measured
p.244-251
43
scalogram analysis
* an **item-analysis** procedure
* entails **graphic mapping** of a testtaker's **responses**
p.250
44
scoring drift
* a **discrepancy** between the scoring in an **anchor protocol** and the scoring of **another protocol**
p. 280
45
selected-response format
* a form of test item
* requiring testtakers to **select a response**
* (e.g., true/false, multiple choice, and matching items)
* as opposed to creating one - contrast with constructed-response format p.252
46
sensitivity review
* a **study of test items**
* usually during test development
* items are examined for **fairness** to all prospective testtakers
* for the presence of offensive language, stereotypes, or situations
p.274
47
short-answer item
* may also be referred to as a completion item
* a word, term, sentence or a paragraph may qualify
* anything beyond this is an essay item
p.254
48
summative scale
* an index derived from the **summing of selected scores** on a test or sub-test
p. 247
49
test conceptualization
* an early stage of the test development process
* when an **idea** for a particular test or test revision is **conceived**
p.240, 241-244
50
test construction
* a stage in the process of test development
* entails **writing test items** (or **rewriting/revising** existing items)
* as well as **formatting items, setting scoring rules**, and otherwise **designing** and **building** a **test**
p.240
51
test development
* an umbrella term for all that goes into the process of creating a test
p. 240-284
52
test revision
* action taken to **modify** a test's **content** or **format**
* for the purpose of **improving** the test's **effectiveness** as a tool of **measurement**
p.240
53
test tryout
* a stage in the process of test development that entails **administering a preliminary version** of a test to a **representative sample** of testtakers
* under **conditions** that **simulate** the **conditions** under which the **final version** of the test will be administered
p.240, 261-262
54
"think aloud" test administration
* a method of **qualitative** item analysis
* examinees **verbalize** their **thoughts** as they take the test
* useful in understanding how
* **individual items function** in a test
* testtakers **interpret or misinterpret** the **meaning** of the individual items
p.274
55
true-false item
* a **binary-choice** item
* i.e., contains only one of two responses
* requires testtaker to indicate whether a statement **is or is not a fact**
p.254
56
validity shrinkage
* the **decrease** in item validities that inevitably occurs **after cross-validation**
p. 278
57
What is the optimal item difficulty?
* usually **midpoint** between **1.0** and the **probability** of answering **correctly** by **guessing**
* which is called the **chance success proportion**
* multi choice (50% chance of getting it right by guessing) - .5 +1.00 = 1.5 divided by 2 = .60 10:00
p.263
58
How can you create a **visual representation** of the **best items** on a test
(i.e., if the objective is to **maximise criterion-related validity**)?
* this can be achieved by **plotting** each item's
* item-validity index and
* item-reliability index
p.265
Fig 8-5