Chapter 8: Test Development Flashcards

1
Q

Stages in the Process of Developing a Test

A
Test Conceptualization
Test Construction
Test Tryout 
Item Analysis
Test Revision
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Test Construction

A

Drafting of items for the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Test Tryout

A

First draft of the test is then tried out on a group of sample testtakers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Item Analysis

A

When statistical procedures are employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Analysis of the Test’s Items Include

A

Analyses of item reliability
Item Validity
Item Discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test Conceptualization

A

There ought to be a test designed to measure (____) in a (____) way; stimulus could be anything; review of related literature on existing tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Preliminary Questions to Ask During Test Conceptualization

A

What is the test designed to Measure?
What is the objective of the test?
Is there a need for this test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the test designed to Measure?

A

Closely linked to how the test developer defines the construct being measured and how that definition is the same as or different from other tests purporting to measure the same construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the objective of the test?

A

In service of what goal will the test be employed? In what way or ways is the objective of this test the same as or different from other tests with similar goals? What real-world behaviors would be anticipated to correlate with testtaker responses?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Is there a need for this test?

A

Are there any other tests purporting to measure the same thing? In what ways will the new test be better than or different from existing ones? Will there be more compelling evidence for its reliability or validity? Will it be more comprehensive? Will it take less time to administer? In what ways would this test not be better than existing tests?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Preliminary Questions to be Addressed

A

Who will use this test?
Who will take this test?
What content will the test cover?
How will the test be administered?
What is the ideal format of the test?
Should more than one form of the test be developed?
What special training will be required of test users for administering and interpreting the test?
What types of responses will be required of testtakers?
Who benefits from an administration of this test?
Is there any potential for harm as a result of an administration of this test?
How will meaning be attributed to scores on this test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Who will use this test?

A

Clinicians? Educators? Others? For what purpose or purposes would this test be used?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Who will take this test?

A

Who is this test for? Who needs to take it? Who would find it desirable to take it? For what age range of testtakers is the test designed? What reading level is required of a testtaker? What cultural factors might affect the testtaker response?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What content will the test cover?

A

Why should it cover this content? Is this coverage different from the content coverage of existing tests with the same or similar objectives? How and why is the content area different? To what extent is this content culture-specific?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How will the test be administered?

A

Individually or in groups? Is it amenable to both group and individual administration? What differences will exist between individual and group administrations of this test? Will the test be designed for or amenable to computer administration? How might differences between versions of the test be reflected in test scores?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the ideal format of the test?

A

Should it be true-false, essay, multiple-choice, or in some other format? Why is the format selected for this test the best format?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Should more than one form of the test be developed?

A

On the basis of a cost-benefit analysis, should alternate or parallel forms of this test be created?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What special training will be required of test users for administering or interpreting the test?

A

What background and qualifications will a prospective user of data derived from an administration of this test need to have? What restrictions, if any, should be placed on distributors of the test and on the test’s usage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What types of responses will be required of testtakers?

A

What kind of disability might preclude someone from being able to take this test? What adaptations or accommodations are recommended for persons with disabilities?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Who benefits from an administration of this test?

A

What would the testtaker learn, or how might the testtaker benefit, from an administration of this test? What would the test user learn, or how might the test user benefit? What social benefit, if any, derives from an administration of this test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Is there any potential for harm as the result of an administration of this test?

A

What safeguards are built into the recommended testing procedure to prevent any sort of harm to any of the parties involved in the use of this test?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How will meaning be attributed to scores on this test?

A

Will a testtaker’s score be compared to others taking the test at the same time? To others in a criterion group? Will the test evaluate masters of a particular content area?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Good item on a Norm-referenced Test

A

An item for which high scorers on the test respond correctly; low scorers on the test tend to respond to that same item incorrectly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Good item on a Criterion-Oriented Test

A

High scorers on the test get a particular item right whereas low scorers on the test get that same item wrong; each item should address the issue of whether the testaker has met certain criteria

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Pilot Work/Pilot Study/Pilot Research

A

Refers to the preliminary reserach surrounding the creation of a prototype of the test; test items may be piloted to evaluate whether they should be included in the final form of the instrument; May involve open-ended interviews with research subjects believed for some reason (perhaps on the basis of an existing test); developer attempts to determine how best to measure a targeted construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Pilot Work Process

A
Entails the Creation
Revision
Deletion of many test items
Literature Reviews
Experimentation
Related activities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Scaling

A

Assignment of numbers according to rules; defined as the process of setting rules for assigning numbers in measurement; process by which a measuring device is designed and calibrated and by which numbers (or other indices)-scale values- are designed to different amounts of the trait, attribute, or characteristic being measured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Age-Based Scale

A

If the Testtaker’s test performance as a function of age is of critical interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Grade-Based Scale

A

If the testtaker’s test performance as a function of grade is of critical interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Stanine Scale

A

If all raw scores on the test are to be transformed into scores that can range from 1 to 9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Categorization of a Test Scale

A

Unidimentional vs. Multidimensional

Comparative vs. Categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Rating Scale

A

Defined as a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by a testtaker; can be used to record judgments of oneself, other, experiences, or objects, and that they can take several forms

33
Q

Summative Scale

A

When final test score is obtained by summing the ratings across all the items

34
Q

Likert Scale

A

used extensively in psychology, usually to scale attitudes; relatively easy to construct

35
Q

Method of Paired Comparisons

A

Testtakers are presented with pairs of stimuli which they are asked to compare; select one of the stimuli according to some rule; the rule that they agree more with one statement than the other, or the rule that they find one stimulus more appealing than the other;

36
Q

Comparative Scaling

A

Entails judgments of a stimulus in comparison with every other stimulus on the scale

37
Q

Categorical Scaling

A

Scaling system that relies on sorting; stimuli are place into one of two or more alternative categories that differ quantitatively with respect to some continuum

38
Q

Guttman Scale

A

Another scaling method that yields ordinal-level measures; items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured; all respondents who agree with the stronger statements of the attitude will also agree with milder statements

39
Q

Scalogram Analysis

A

Item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses; Objective for the developer of a measure of attitudes is to obtain an arrangement of items wherein endorsement of one item automatically connotes endorsement of less extreme positions

40
Q

How to Create a Scale using Thurstone’s equal-appearing interval method

A

Collect a reasonably large number of statements reflecting positive and negative attitudes towards a topic are collected
Judges or experts evaluate each statement in terms of how strongly it indicates that the topic is justified. Each judge is instructed to rate each statement on a scale as if the scale were interval in nature
Mean and a standard deviation of the judges’ ratings are calculated for each statement
Items are selected for inclusion in the final scale based on several criteria, including (a) the degree to which the item contributes to a comprehensive measurement of the variable in question (b) the test developer’s degree of confidence that the items have indeed been sorted into equal intervals
Scale is now ready for administration; the way the scale is used depends on the objectives of a test situation

41
Q

Scaling Method Employed Depends on

A

Variables being measured
Group for whom the test is intended
Preferences of the test developer

42
Q

Questions to Ask for the Test Blueprint

A

What range of content should the items cover?
Which of the many different types of item formats should be employed?
How many items should be written in total and for each content area covered?

43
Q

Item Pool

A

Reservior or well from which test items will or will not be drawn for the final version of the test

44
Q

Item Format

A

Include variables such as the form, plan, structure, arrangement, and layout of individual test items

45
Q

Types of Response Formats

A

Selected-Response Format

Constructed-Response Format

46
Q

Selected-Response Format

A

Require testtakers to select a response from a set of alternative responses

47
Q

Constructed-Response Format

A

Require the testtakers to supply or to create the correct answer, not merely to select it

48
Q

Types of Selected-Response Item Formats

A

Multiple Choice
Matching
True or False

49
Q

Elements of Multiple-Choice Format

A

Stem
Correct Alternative or option
Several incorrect alternatives or options variously referred to as distractors or foils

50
Q

Characteristics of a good multiple-choice item in an achievement test

A

Has one correct alternative
Has grammatically parallel alternatives
Has alternatives of similar length
Has alternatives that fit grammatically with the stem
Includes as much of the item as possible in the stem to avoid unnecessary repetition
Avoids ridiculous distractors

51
Q

Matching Item

A

Testtaker is presented with two columns: Premises on the left and responses to the right;

52
Q

Binary Choice Item

A

Multiple-choice item that contains only two possible responses

53
Q

True-False Item

A

The most familiar binary-choice item; type of selected-response item which takes the form of a sentence that requires the testtaker to indicate whether the statement is or is not a fact

54
Q

Good Binary Choice Item

A

Contains a single idea, is not excessively long, and is not subject to debate; correct response must undoubtedly be one of the two coices

55
Q

Completion Item

A

Requires the examinee to provide a word or phrase that completes a sentence; also known as Short-Answer Item

56
Q

Good Completion Item

A

Should be worded so that the correct answer is specific; Should be written clearly enough that the testtaker can respond succinctly (with a short answer)

57
Q

Essay Item

A

Useful when the test developer wants the examinee to demonstrate a depth of knowledge about a single topic; permits restating of learned material and allows for the creative integration and expression of the material in the testtaker’s own words; subjective and inter-scorer differences

58
Q

Item Bank

A

Relatively large and accessible collection of test questions; advantage is accessibility to a large number of test items conveniently classified by subject area, item statistics, or other variables

59
Q

Item Branching

A

Technique with the ability to individualize testing; ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items

60
Q

Computerized Adaptive Testing (CAT)

A

Refers to an interactive, computer-administered testtaking procedure wherein items presented to the testtaker are absed in part on the testtaker’s performance on previous items; tends to reduce floor effects and ceiling effects

61
Q

Floor effect

A

Refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured

62
Q

Ceiling Effect

A

Refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, or other attribute being measured

63
Q

Class or Category Scoring

A

Employs testtaker responses which earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way; used by dome diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis

64
Q

Ipsative Scoring

A

Comparing a testtaker’s score on one scale within a test to another scale within that same test

65
Q

Edwards Personal Preference Schedule

A

EPPS designed to measure the relative strength of different psychological needs

66
Q

Formal Item-Analysis

A

Cross Validation
Co-Validation
Quality Assurance During Test Revision

67
Q

Tests Due For Revision When The Following Conditions Exist

A

Stimulus materials look dated and current testtakers cannot relate to them.
Verbal Content of the test, including the administration instructions and the test items, contains dated volcabulary that is not readily understood by current testtakers
As popular culture changes and words take on new meanings, certain words or expressions in the test items or directions may be perceived as inappropriate or even offensive to a particular group and must therefore be changed.
Test norms are no longer adequate as a result of age-related shifts in the abilities measured over time, and so an age extension of the norms (upward, downward, or in both directions) is necessary
The reliability or the validity of the test, as well as the effectiveness of individual test items, can be significantly improved by a revision
The theory on which the test was originally based has been improved significantly, and these changes should be reflected in the design and content of the test.

68
Q

Cross-Validation

A

Refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion

69
Q

Validity Shrinkage

A

The decrease in item validities that inevitably occurs after cross-validation of findings; expected and viewed as integral to the test development process; infinitely preferable to a scenariou wherein high item validities are published in a test manual as a result of inappropriately using the identical sample of testtakers for test standardization and cross-validation of findings

70
Q

Test Manual

A

Should outline the test development procedures used

Reliability information, including test-retest validity and Internal consistency estimates

71
Q

Co-Validation

A

Defined as a test validation process conducted on two or more tests using the sample of testtakers

72
Q

Co-Norming

A

Process that occurs when co-validation is used in conjunction with the creation of norms or the revision of existing norms

73
Q

Anchor Protocol

A

Test protocol scored by a highly authoritative scorer that is designed as a model for schoring and a mechanism for resolving scoring discrepancies

74
Q

Scoring Drift

A

A discrepancy between scoring in an anchor protocol and the scoring of another protocol

75
Q

Roles of IRT in Test Construction

A

Evaluating existing tests for the purpose of mapping test revisions
Determining measurement equivalence across testtaker populations
Developing item banks

76
Q

IRT Information Curves

A

Help test developers evaluate how well an individual item (or entire test) is working to measure different levels of the underlying construct
Can be used to weed out uninformative questions
Eliminate redundant items to tailor an instrument to provide high information (Precision)

77
Q

Differential Item Functioning (DIF)

A

Phenomenon wherein an item functions differently in one group of testtakers as compaered to another group of testtakers known to have the same (or similar) level of the underlying Trait

78
Q

DIF Analysis

A

A process by which test developers scrutinize group-by-group item response curves, looking for DIF Items; used to evaluate the effect of different test administration procedures and item ordering effects

79
Q

DIF Items

A

Items that respondents from different groups at the same level of the underlying traid have different probabilities of endorsing as a function of their group membership