Writing and Evaluating Test Items Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Writing test items can be difficult. (DeVellis, (2016) provided several simple
guidelines for item writing

A

Define clearly what you want to measure

Generate an item pool.

Avoid exceptionally long items.

Keep the level of reading difficulty appropriate for those who will complete the scale

Avoid “double-barreled” items that convey two or more ideas at the same time.

Consider mixing positively and negatively worded items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

when writing items, you need to be sensitive to ethnic and cultural differences.

A

items on the CES-D concerning appetite, hopefulness, and social interactions may have a different meaning for African American respondents than for white respondents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

tests may become obsolete.

A

Armed Services Vocational Aptitude Battery was studied over a 16-year period. Approximately 12% of the items became less reliable over this time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Items that retained their reliability were more likely to focus on ____

lost reliability focus on ______

A

skills, while those that lost reliability focused on more abstract concepts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

dichotomous format

A

two alternatives for each item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

advantages of dichotomous

A

advantages of true-false items include their obvious simplicity, ease of
administration, and quick scoring. Another attractive feature is that the true-false items require absolute judgment. The test taker must declare one of the two
alternatives

make the scoring of the subscales easy. All that a tester needs to do is count the number of items a person endorses from each subscale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

disadvantages of dichotomous

A

MEMORIZE encourage students to memorize material, making it possible for students to perform well on a test that covers materials they do not really understand.

COMPLEXITY “truth” often comes in shades of gray, and true-false tests do not allow test takers the opportunity to show they understand this complexity.

MANY ITEMS - mere chance of getting any item correct is 50%. Thus, to be reliable, a true-false test must include many items.

LESS RELIABLE/PERCISE Overall, dichotomous items tend to be less reliable, and therefore less precise than some of the other item formats.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

polytomous format

A

each item has more than two alternatives

a point is given for the selection of one of the alternatives, and no point is given for selecting any other choice

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

advantage of polytomous

A

major advantage of this format is that it takes little time for test takers to respond to a particular item because they do not have to write. Thus, the test can cover a large amount of information in a relatively short time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

issues in the construction and
scoring of multiple-choice tests

A

how many distractors should a test have? Psychometric theory suggests that adding more distractors should increase the reliability of the items. However, in practice, adding distractors may not actually increase the reliability because it is difficult to find good ones. The reliability of an item is not enhanced by distractors that no one would ever select.

Ineffective distractors actually may hurt the reliability of the test because they are time consuming to read and can limit the number of good items that can be included in a test.

usually best to develop three or four good distractors for each item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Guessing

A

“correct” answers simply by guessing, a correction for guessing is sometimes
used.

, if a correction for guessing is used, then
random guessing will do you no good. Some speeded tests are scored so that the correction for the guessing formula includes only the items that were attempted— that is, those that were not attempted are not counted either right or wrong. In this case, random guessing and leaving the items blank have the same expected effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How about cases where you do not know the right answer but can eliminate one
or two of the alternatives?

A

advise you to guess

The correction formula assumes that you are equally likely to respond to each of the four categories. For a four-choice item, it would estimate your chance of getting the item correct by chance alone to be 1 in 4. However, if you can eliminate two alternatives, then the chances are actually 1 in 2. This gives you a
slight advantage over the correction formula

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

students are more likely to guess when …

A

they anticipate a low grade on a test than when they are more confident

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

discourage guessing by

A

giving students partial credit for items left blank

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

guessing threshold

A

describes the chances that a low-ability
test taker will obtain each score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Likert format, the category scale, and the Q-sort

A

do not judge any response as “right”
or “wrong.” Rather, they attempt to quantify the characteristics of the response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

essay

A

commonly used in classroom evaluation, and the Educational Testing Service now uses a writing sample as a component of its testing
programs.

reliability of the scoring procedure should be assessed by determining the association between two scores provided by independent scorers.

In practice, however, the psychometric properties of essay exams are rarely evaluated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Likert format

A

indicate the degree of agreement with a particular attitudinal question.

some applications, six options are used
to avoid allowing the respondent to be neutral.

Scoring requires that any negatively worded items be reverse scored and the responses are then summed. This format is especially popular in measurements of
attitude

subjected to factor analysis,
test developers can find groups of items that go together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

issues with likert

A

challenged the appropriateness of using traditional parametric statistics to analyze Likert responses because the data are at the ordinal rather than at an interval level

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

category format.

A

similar to the Likert format but that uses an even greater number of choices

10-point rating systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

issues with category

A

responses to items on 10-point scales are affected by the groupings of the people or things being rated

A particular player rated as a 6 when
he was on a team with many outstanding players might be rated as a 9 if he were
judged with a group of poorly coordinated players

People will change ratings depending
on context

When given a group of objects to rate, subjects have a tendency to spread their responses evenly across the 10 categories

reliability and validity may be higher if all response options are clearly labeled, as opposed to just labeling the categories at the extremes

22
Q

how to avoid problems with category format

A

problem can be avoided if the endpoints
of the scale are clearly defined and the subjects are frequently reminded of the
definitions of the endpoints.

, testers might show them films that
depict the performance of a player rated as 10 and other films showing what the rating of 1 means. Under these circumstances, the subjects are less likely to offer a response that is affected by other stimuli in the group

23
Q

category format - 9 vs10pt

4-7

A

N. H. Anderson (1991) has
found that a 10-point scale provides substantial discrimination among objects for a wide variety of stimuli. Some evidence suggests that increasing the number of response

categories may not increase reliability and validity.

increasing the number of choices beyond nine or so can reduce reliability because responses may be more likely to include an element of randomness when there are so many alternatives that respondents cannot clearly discriminate between the fine-grained choices

optimum number of categories is
between four and seven. Reliability suffers when fewer than four categories are used, but the increase in reliability does not increase much when form than seven
categories are available

24
Q

visual analogue scale

A

the respondent is given
a 100-millimeter line and asked to place a mark between two well-defined endpoints

popular for measuring self-rated
health. However, they are not used often for multi-item scales, because scoring is timeconsuming

25
Q

Checklists

A

adjective checklist - long list of adjectives and indicates whether each one is characteristic of himself or herself. - dichotomous

26
Q

Q-Sorts

A

can be used to describe oneself or to provide ratings of others

a subject is given statements and asked to sort them into nine piles.

gave observers 100 statements about personal characteristics. The statements were sorted into piles that indicated the degree to which they appeared to describe a given person accurately

Most of the cards are usually placed in piles 4, 5, and 6.

frequency of items placed in each of the categories usually looks like a bell-shaped curve

27
Q

item difficulty

A

defined by the number of people who get a particular item correct.

ie, if 84% of the people taking a particular test get item 24 correct, then the difficulty level for that item is .84.

The higher the proportion of people who get the item correct, the easier the item.

28
Q

optimal difficulty level for items is usually

A

halfway between 100% of the respondents getting the item correct and the level of success expected by chance alone.

optimum difficulty level for a four-choice item is approximately .625.
To arrive at this value, we take the 100% success level (1.00) and subtract the chance performance level (.25)/2 and add chance

29
Q

In constructing a good test, one must also
consider human factors

A

, though items answered correctly by all students will have poor psychometric qualities, they may help the morale of the students who take the test. A few easier items may help keep test anxiety in check, which in turn adds to the reliability of the test

30
Q

Discriminability

A

examine the relationship between performance on particular items and performance on the whole test

31
Q

item discriminability

A

determines whether the people who have done well on particular items have also done well on the whole test

32
Q

The Extreme Group Method

A

compares people who have done well with those who have done poorly on a test.

find the students with test scores in the top third and those in the bottom third of the class.

Then, you would find the proportions
of people in each group who got each item correct.

The difference between these
proportions is called the discrimination index

33
Q

The Point Biserial Method

A

find the correlation between performance on the item and performance on the total test

. If this value is negative
or low, then the item should be eliminated from the test. The closer the value of the index is to 1.0, the easier the item.

34
Q

item characteristic curve.

A

graph for each individual test item. On these individual item graphs, the total test score is
plotted on the horizontal (X) axis and the proportion of examinees who get the item
correct is plotted on the vertical (Y) axis

total test score is used as an estimate of
the amount of a “trait” possessed by individuals.

relationship between performance on the item and performance on the test gives some information about how well the item is tapping the information we want.

35
Q

item characteristic curve for a “good” test item

A

gradual positive slope of the line demonstrates that the proportion of people who pass the item gradually increases as test scores increase. This means that the item successfully discriminates at all levels of test performance.

36
Q

“none of the above.”

A

Students who are exceptionally knowledgeable in the subject area can sometimes rule out all the choices even though one of the alternatives has actually been designated as correct.

37
Q

Classical test Theory

A

score is derived from the sum of an individual’s responses to various items, which are sampled from a larger domain that represents a specific trait or ability.

38
Q

item response theory

A

item analysis

each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability level of each test taker.

With the computer, items can be sampled, and the specific range of items where the test taker begins to have difficulty can be identified

testers can make an ability judgment without subjecting the test taker to all of the test items.

39
Q

advantages of item response theory

A

builds on traditional models of
item analysis and can provide information on item functioning, the value of specific
items, and the reliability of a scale

most important message for the test taker is that his or her score is no longer
defined by the total number of items correct, but instead by the level of difficulty
of items that he or she can answer correctly

40
Q

implications of IRT

A

some people believe that IRT was the most important development in psychological testing in the second half of the 20th century

41
Q

various approaches to the construction of tests using IRT- two dimensions

A

difficulty and discriminability. Other approaches add a third dimension for the probability that test takers with the lowest levels of ability will get a correct response

42
Q

advantage of IRT

A

easily adapt them for computer administration. The computer can rapidly identify the specific items that are required to assess a particular ability level

test takers do not have to suffer the embarrassment of attempting multiple items
beyond their ability. or waste time and effort on items below their capability

43
Q

disadvantage of IRT

A

Though the precision of the test is best for those at average ability levels, those with the lowest or highest ability levels are not well assessed by this type of test

44
Q

“rectangular conventional”

A

requires that test items be selected to create a wide range in level of difficulty.

e items are pretested and selected to cover evenly the span from easiest to most
difficult.

problem with this approach is that only a few items of the test are appropriate for individuals at each ability level; that is, many test takers spend much of their time responding to items either considerably below their ability level or too difficult to solve. As a result, measurement precision is constant across the range of test-taker abilities but relatively low for all people,

45
Q

IRT addresses traditional problems in test construction well

A

IRT can handle items that are written in different formats

IRT can identify respondents with unusual response patterns and offer insights into cognitive processes of the test taker

. Use of IRT may also reduce the biases against people who are slow in completing test problems. In other words, by presenting questions at the test taker’s ability level, IRT and computer-adaptive testing allow the defined time spent on taking the test to be used most efficiently by test takers

46
Q

internal criteria

A

Item analysis has been persistently plagued by researchers’ continued dependence on
internal criteria, or total test score, for evaluating items.

The examples we have just given demonstrate how to compare performance on an item with performance on the total test.

47
Q

external criterion

A

, if you were building a test
to select airplane pilots, you might want to evaluate how well the individual items
predict success in pilot training or flying performance. The advantages of using external rather than internal criteria against which to validate items were outlined by

rarely used in practice

48
Q

Linking Uncommon Measures

A

One challenge in test applications is how to determine linkages between two different measures.

Interpretation of the test results for students who took the test at different times requires that scores on each administration have the same meaning, even though the tests include different items—that is, we assume that a score of 600 means the same thing for two
students even though the two students completed different tests.

not feasible to compare the wide array of commercial and state achievement tests to one another. Further, they concluded that developing transformation methods for individual scores should not be done

49
Q

Items for Criterion-Referenced Tests

A

traditional use of tests requires that we determine how well someone has done on a test by comparing the person’s performance to that of others.

criterion-referenced test compares performance with some clearly defined
criterion for learning. This approach is popular in individualized instruction
programs. For each student, a set of objectives is defined that state exactly what
the student should be able to do after an educational experience.

used to determine whether
this objective had been achieved.

. Many educators regard criterion-referenced
tests as diagnostic instruments

50
Q

first step in developing criterion-referenced tests

A

clearly specifying the objectives by writing clear and precise statements about what the learning program is attempting to achieve

These statements are usually stated in terms of something the student will be able to do

51
Q

To evaluate the items in the criterion-referenced test

A

one should give the test to two groups of students—one that has been exposed to the learning unit and one that has not

V distribution

bottom of the V is the antimode, or
the least frequent score. This point divides those who have been exposed to the unit
from those who have not been exposed and is usually taken as the cutting score or
point, or what marks the point of decision

52
Q

Criterion-referenced tests offer many advantages to newer educational approaches.

A

computer-assisted instruction, each student works at his or her own pace on an individualized program of instruction, after which a criterion-referenced test is used to evaluate progress.

Students who pass the test can move on to the next unit. Students who do not pass can repeat some of the instruction until they pass