Ch. 8 - Test Development Flashcards

1
Q

5 stages of test development

A
  1. conceptualization 2. construction 3. tryout 4. item analysis 5. test revision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

test construction

A

process of writing possible test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

test tryout

A

administering a test to a representative sample of testtakers under conditions that simulate those of the final version of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

some questions to ask when developing a new test

A

What is the test designed to measure? (what construct)
Is there a need for this test?
Who will use and take the test?
How will the test be administered?
Is there any potential for harm?
How will meaning be attributed to the scores on the test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

on a norm-referenced test, a good item is one that…

A

high scorers on the whole test get right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

on a criterion referenced test, you need to do exploratory/pilot work with…

A

a group known to have mastered the skill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pilot work/study

Why is it done?

A

work done surrounding the creation of the prototype of a test
done to determine how to best measure a targeted construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

scaling

A

setting rules for assigned #s in measurement; the process by which a measuring device is designed and calibrated and by which #s (or other indices) AKA scale values are assigned to different amounts of the thing being measured.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

stanine scale

A

all raw scores on the test can be transformed into scores that range from 1 to 9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

age and grade-based scales

A

if testtakers’ performance is a function of age or grade is of critical interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Likert scale

A

very reliable, has a scale of 1-5 or 1-7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

rating scales

provide what kind of data?

A

grouping of words, statements, or symbols on which judgments of the strength of a particular thing are indicated by the testtaker
ALL rating scales provide ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

method of paired comparisons

A

testtakers are presented with a pair of stimuli and must choose between then.
provide ordinal data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

comparative scaling

A

testtaker must judge a stimulus in comparison with every other stimulus on the scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

catagorical scaling

A

stimului are placed into one of two or more alternative cateogires that differ quantitatively with respect to some continuum. For ex: sort into “never justified” “sometimes justified” “always justified”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Guttman scale

A

items on it range sequentially from weaker to stronger expressions of the thing. (everyone who agrees with the stronger statement agrees with the weaker ones). used in consumer research
AKA scalogram analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

direct estimation vs indirect estimation

A

in direct estimation, you don’t need to transform a testtaker’s responses into some other scale. in indirect, you do need to transform those responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

equal-appearing intervals method

A

the only rating scale described that has items that are interval in nature (ex: suicide scale) - there are presumed to be equal distances between the values on the scale (interval scale)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How many test items should an item pool contain for a multiple-choice test?

A

twice the number of the final number of test items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

item pool

A

assembly of many items (from brainstorming all possibilities or many possibilities of test items)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

selected-response format (item types)

A

multiple choice
matching
true-false (binary-choice item)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

constructed-response format (item types)

A

completion item
short answer
essay
looking for synthesis of info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

item bank

A

collection of GOOD test items. These items will continue to be selected and used or rotated. Finalized version of the item bank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

CAT

A

computerized adaptive testing - test-taking process wherein items presented to the testtaker are based on performance of previous items. may be displayed according to rules (e.g. only after you get 2 hard ones right, show the next level).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

floor effect vs. ceiling effect reduced by?

A

CAT tends to reduce these. floor effect - not distinguishing between low scores/low ability
ceiling effect - not distinguishing between high scorers/high ability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

item branching

A

the ability of a computer to tailor the content and order of presentation items on the basis of responses to previous items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

CAT found to reduce…

A
# of test items (by 50%)
measurement error (by 50%)
28
Q

class scoring AKA

A
category scoring
testtakers will be placed in a certain group/class with other testtakers whose pattern of responses was similar in some way (e.g., diganosis)
29
Q

cumulative scoring model

A

higher the score, the higher the testtaker is on the thing being measured

30
Q

ipsative scoring

A

compares testtakers score on one thing within a test to another thing on a test (thing = scale). comparing yourself with yourself.
e.g. Jon is cooler than he is smart BUT you can’t say Jon is cooler than Jenny

31
Q

test tryout - how many people?

A

no fewer than 5 for EACH item on the test. more the better. preferably 10.

32
Q

How do you tell whether a test item is good?

A

item analysis

generally, a good test item is answered correctly by high scorers on the test as a whole

33
Q

item analysis

A

the statistical scrutiny of test data

34
Q

what do you scritinize in item analysis?

A

item’s: difficulty, validity, relaibility, item discrimination (IDDRV)

35
Q

item-difficulty index

expected values

A

denoted by lowercase italicized p
= to testtakers who got it correct / total testtakers
greater the p, the easier the item
if >.90 - is it really needed? other than as a giveaway/warmup for those with test anxiety

36
Q

what is an item-difficulty index called on a personality test?

A

item-endorsement index

37
Q

item-reliability index

A

shows the internal consistency of a test
higher the value, the greater the test’s internal consistency
=s*r (item SD * correlation between item score and test score)

38
Q

item-validity index

A

measures the degree to which a test is measuring what it’s supposed to measure

39
Q

item-discrimination index

A

measures how adequately an item separates/discriminates between high and low scores on the entire test
yields a lowercase italicized d
d compares performance on a particular item with performance on the upper and lower regions of a distribution of continuous test scores
higher the d, the higher the # of high scorers are answering it correctly

40
Q

what does a high d value mean

A

higher the d, the higher the # of high scorers are answering a test item correctly
Bonus: if it’s a -d, that means that more low-scorers answer it correctly than high scorers
Bonus: if d = 0, same # of high and low scorers get it right

41
Q

what does a high p mean

A

the test item is easy

42
Q

analysis of item alternatives

A

for multiple-choice items, see how many people answered the distractors and evaluate them appropriately (e.g., maybe too distracting/wording needs to be changed)

43
Q

item-characteristic curve

A

ICC - a graphic representation of item difficulty and discrimination

44
Q

steep slope in ICC means?

A

greater item discrimination

45
Q

“good” item looks like what in ICC

A

straight line with a slope

46
Q

“good” item for a cutoff-score test or criterion-based test

A

looks like the top of an F

47
Q

guessing - what issues does it present in item analysis?

A
  • guesses are not made totally randomly
  • how do we deal with omitted items?
  • some people are luckier guessers than others
48
Q

item fairness

A

the degree (if any) a test item is biased

49
Q

biased test item

A

favor one particular group of examinees when differences in group ability are controlled

50
Q

if an item is fair, its ICC should…

A

not be significantly differnt for different groups regardless of ability

51
Q

item analysis in speed tests

A

yield misleading or uninterpretable results because items that are closer to the end appear more difficult just because few people were able to finish them

52
Q

what are methods of qualitative item analysis

A

intervies, group discussions, “think aloud” test administration (sheds light on thought patterns), and sensitivity reviews

53
Q

sensitivity reviews

A

an expert panel. items on a test are examined for fairness to all prospective testatkers, flag offensive language, stereotypes

54
Q

test revision (as a stage in test development) - strategy

A

characterize each item according to its strengths and weaknesses
consider the purpose of the test - if for hiring and firing, eliminate biased items
if for culling most skilled performers - get items with the best item discrimination to ID the best of the best

55
Q

sandardization

A

process used to introduce objectivity and uniformity into test administration, scoring, and interpretation

56
Q

What do you need to do after item analysis?

A

administer the revised test under standardized conditions, then cross-validation

57
Q

When should you revise a test?

A

stimulus look dated, dated vocabulary, offensive language, test norms aren’t adequate (group membership change), age-related shifts in the abilities over time, improve the reliability or validity of the test, theory on which test was based has improved

58
Q

steps to revising an existing test

A

-all steps to make a new one (conceptualization, construction, tryout, item analysis, revision) + need to determine whether there is equivalence between the old and new versions of the test. likely scores will not mean the same thing (item analysis to evaluate stability of items between revisions of the same test)

59
Q

cross-validation

what is inevitable?

A

re-validation of a test on a sample of testtakers other than the original group the test was found to be valid on. (aCROSS groups)
validity shrinkage is inevitable

60
Q

co-validation

A

test validation process conducted on two or more tests using the same sample of testtakers (economical - test subjects ID’ed once, personnel costs)

61
Q

co-norming

benefits?

A

co-validation on two tests and creating norms or revising existing norms
good for test users if tests are often used together bc they are normed on the same population (sampling error has been eliminated basically)
like co-validation, saves money

62
Q

quality assurance in test revision

A

confirming that a test is given the same way

63
Q

anchor protocol

A

test protocol scored by a highly trained scorer, designed as a model for scoring and mechanism for resolving scoring discrepancies

64
Q

protocol drift

A

the discrepancy between and anchor protocol and another scorer’s protocol

65
Q

differential item functioning

A

(DIF) - when an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same/similar level of underlying trait. This means that for some reason respondents from different groups have different probabilities of endorsing as a function of their group membership (ex: Asian women fear cultural shame around feeling depressed)