Test Development Flashcards

1
Q

An emerging social phenomenon or pattern of behavior might serve as the stimulus for the development of a new test or in response to a need to assess mastery in emerging occupations or professions.

A

Test Conceptualization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Criterion-referenced testing and assessment are commonly employed in _ and _ contexts.

A

Licensing
Educational context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The items that best discriminate between 2 groups would be considered the _ items.

A

Good items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

A good items on a _ test is an item for which high scorers on the test respond correctly and low scorers respond incorrectly.

A

Norm-referenced test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The preliminary research surrounding the creation of a prototype of the test. It should be done to evaluate whether they should be included in the final form of the instrument.

A

Pilot work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

The process by which a measuring device is designed and calibrated and by which numbers are assigned to different amounts of trait, attribute or characteristics being measured.

A

Scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

He is the one credited for being at the forefront efforts to develop methodologically sound scaling methods.

A

L. L. Thurstone

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Types of scales

A

Age-based scale
Grade-based scale
Stanine scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A type of scale where all raw scores on the test are to be transformed into scores that can range from 1-9.

A

Stanine scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

The 3 scaling methods

A

Rating Scale
Summative scale
Likert scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A grouping of words, statements or symbols on which judgments of the strength of a particular trait, attitude of emotion are indicated by the test taker.

A

Rating Scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Test score is obtained by summing the rating across all the items.

A

Summative scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A type of summative rating scale that is used extensively in psychology to scale attitudes. Each items present the testtakers with five alternative responses usually on an agree-disagree or approve-disappaprove continuum.

A

Likert scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

When one dimension is presumed to underlie the ratings.

A

Unidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When more than 1 dimension is thought to guide the testtaker’s responses.

A

Multidimensional

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 4 scaling methods that produce ordinal data?

A

Method of paired comparison
Comparative scaling
Categorical scaling
Guttman scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A scaling method that produces ordinal data. Testtakers are presented with pairs of stimuli which they are asked to compare, and they must select one of the stimuli according to some rule. Then they receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.

A

Method of Paired comparison

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A scaling method that produces ordinal data. Stimuli such as printed cards, drawings, photographs or other objects are typically presented to testtakers for evaluation and must be sort from most justifiable to least justifiable. It could also be accomplished through the use of list of items on a sheet of paper.

A

Comparative scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

A scaling method that produces ordinal data. Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.

A

Categorical scaling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

A scaling method that produces ordinal data. Items on it range sequentially from weaker to stronger expressions of the attitude, belief or feeling being measured. All respondents who agree with the stronger statements of the attitude will also agree with milder statements.

A

Guttman scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

The resulting data of Guttman scale are analyzed by the means of this. This is an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses.

A

Scalogram Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The reservoir from which the item will or will not be drawn for the final version of the test. Items available for use as well as new items created especially for the item bank.

A

Item pool

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

It is the form, plan, structure, arrangement and layout of individual test items.

A

Item format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

The two types of item format:

A

Selected response format
Constructed response format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

It requires testtakers to select a response from a set of alternative responses.

A

Selected response format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

3 Types of selected response format:

A

Multiple choice format
Matching item
True-false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Several incorrect alternatives or options in a multiple choice format are referred to as _.

A

Distractors or foils

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A selected response format where the testtaker is presented with 2 columns where they have to determine which response is best associated with which premise.

A

Matching item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A multiple choice item format that contains only two possible responses (binary choice) (agree or not, yes or no, right or wrong, fact or opinion). It usually takes the form of a sentence.

A

True-false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

3 types of constructed response items:

A

Completion item
Short-answer item
Essay

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

A constructed response format that requires the examinee to provide a word or phrase that completes a sentence.

A

Completion item

32
Q

A constructed response format where a word, term, sentence or paragraph may qualify as an answer.

A

Short-answer item

33
Q

A constructed response format that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis and/or interpretation.

A

Essay

34
Q

A relatively large and easily accessible collection of test questions.

A

Item bank

35
Q

An interactive, computer-administered test-taking process wherein items presented to the testtaker are based in part on the test takers’ performance on previous items.

A

Computerized adaptive testing

36
Q

It refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait or other attribute being measured. Testtakers who have not yet achieved such ability might fail all the items.

A

Floor effect

37
Q

It refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the attribute being measured. It is likely that the test users who answered all of the items correctly conclude that the test was too easy.

A

Ceiling effect

38
Q

The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to the previous items. Random presentation of test items.

A

Item branching

39
Q

What are the 3 different scoring models?

A

Cumulative model
Class scoring or Category scoring
Ipsative scoring

40
Q

Scoring model where the higher the score on the test, the higher the testtakers are on the ability or characteristic that the test purpots to measure.

A

Cumulative model

41
Q

Scoring model where testtakers earn credit toward placement in a particular class or category with other testtaker whose pattern of responses is presumably similar in some way. Used by some diagnostic systems.

A

Class scoring or Category scoring

42
Q

Scoring model that compares a testtaker’s score on one scale within a test to another scale within that same test.

A

Ipsative scoring

43
Q

The informal rule of thumb for test tryout is that there should be no fewer than _ subjects and preferably as many as _ for each item on the test.

A

5
10

44
Q

Factors that actually are just artifacts of the small sample size.

A

Phantom factors

45
Q

A lowercase itallic “p” is used to denote _.

A

Item Difficulty

46
Q

The larger the item difficulty index, the _ the item.

A

Easier

47
Q

The optimal average item difficulty for maximum discrimination among the abilities of testtakers.

A

Approximate 0.5

48
Q

The range of difficulty for individual items on the test.

A

0.3-0.8

49
Q

For the possible effect of guessing, the optimal average item difficulty is usually the midpoint between _ and the chance success proportion.

A

1.00

50
Q

The probability of answering correctly by random guessing.

A

Chance success proportion

51
Q

The higher the index, the greater the test’s _.

A

Internal consistency

52
Q

A statistical tool useful in determining whether items on a test appear to be measuring the same thing.

A

Factor analysis

53
Q

It is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure.

A

Item-validity index

54
Q

The higher the item validity index, the greater the test’s _.

A

Criterion-related validity

55
Q

It compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores.

A

Item discrimination index

56
Q

Item discrimination index is symbolized by _.

A

Lowercase itallic “d”

57
Q

The _ the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring testtakers.

A

Higher

58
Q

The highest possible value of d.

A

+/-1.00

59
Q

The value of d that indicates the item is not discriminating for there is the same proportion of members of the upper and lower groups who pass the item.

A

0

60
Q

The lowest value that an index of item discrimination can take. It indicates that all members of the upper group failed the item and all members of the lower group passed it.

A
  • -1
61
Q

It is a graphic representation of item difficulty and discrimination. The steeper the slope, the greater the item discrimination.

A

Item-characteristic curves

62
Q

It is an item that favors one particular group of examinees in relation to another when differences in group ability at e controlled.

A

Biased test item

63
Q

It is exemplified by different shapes of item-characteristic curves for different groups when the 2 groups do not differ in total test score.

A

Differential item functioning

64
Q

These are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures.

A

Qualitative methods

65
Q

It is a general term for various no statistical procedures designed to explore how individual test items work. Involves exploration of the issues through verbal means.

A

Qualitative item analysis

66
Q

A qualitative research tool designed to shed light on the testtaker’s thought processes during the administration of a test. They are asked to think aloud as they respond to each item.

A

“Think aloud” test administration

67
Q

If a study of test items typically conducted during the test developmental processes in which the items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes or situations.

A

Sensitivity review

68
Q

It refers to the revalidation of a test on a sample of testtakers other than those of whom test performance was originally found to be a valid predictor of some criterion.

A

Cross validation

69
Q

The decrease in item validitities that inevitably occurs after cross-validation of finding.

A

Validaty shrinkage

70
Q

Test validation process conducted on two or more tests using the same sample of testtakers.

A

Co-validation

71
Q

Used in conjunction with the creation of norms or the revision of existing norms.

A

Co-norming

72
Q

The discrepancies of the scorers are resolved by another scorer which is called?

A

Resolver

73
Q

Scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies.

A

Anchor protocol

74
Q

Discrepancy between scoring in an anchor protocol and the scoring of another protocol.

A

Scoring drift

75
Q

Those items that respondents from different groups at the same level have underlying trait have different probabilities of endorsing as a function of their group membership.

A

Differential item functioning (DIF) items