Writing and Evaluating Test Items Flashcards
Writing test items can be difficult. (DeVellis, (2016) provided several simple
guidelines for item writing
Define clearly what you want to measure
Generate an item pool.
Avoid exceptionally long items.
Keep the level of reading difficulty appropriate for those who will complete the scale
Avoid “double-barreled” items that convey two or more ideas at the same time.
Consider mixing positively and negatively worded items.
when writing items, you need to be sensitive to ethnic and cultural differences.
items on the CES-D concerning appetite, hopefulness, and social interactions may have a different meaning for African American respondents than for white respondents
tests may become obsolete.
Armed Services Vocational Aptitude Battery was studied over a 16-year period. Approximately 12% of the items became less reliable over this time.
Items that retained their reliability were more likely to focus on ____
lost reliability focus on ______
skills, while those that lost reliability focused on more abstract concepts
dichotomous format
two alternatives for each item
advantages of dichotomous
advantages of true-false items include their obvious simplicity, ease of
administration, and quick scoring. Another attractive feature is that the true-false items require absolute judgment. The test taker must declare one of the two
alternatives
make the scoring of the subscales easy. All that a tester needs to do is count the number of items a person endorses from each subscale.
disadvantages of dichotomous
MEMORIZE encourage students to memorize material, making it possible for students to perform well on a test that covers materials they do not really understand.
COMPLEXITY “truth” often comes in shades of gray, and true-false tests do not allow test takers the opportunity to show they understand this complexity.
MANY ITEMS - mere chance of getting any item correct is 50%. Thus, to be reliable, a true-false test must include many items.
LESS RELIABLE/PERCISE Overall, dichotomous items tend to be less reliable, and therefore less precise than some of the other item formats.
polytomous format
each item has more than two alternatives
a point is given for the selection of one of the alternatives, and no point is given for selecting any other choice
advantage of polytomous
major advantage of this format is that it takes little time for test takers to respond to a particular item because they do not have to write. Thus, the test can cover a large amount of information in a relatively short time
issues in the construction and
scoring of multiple-choice tests
how many distractors should a test have? Psychometric theory suggests that adding more distractors should increase the reliability of the items. However, in practice, adding distractors may not actually increase the reliability because it is difficult to find good ones. The reliability of an item is not enhanced by distractors that no one would ever select.
Ineffective distractors actually may hurt the reliability of the test because they are time consuming to read and can limit the number of good items that can be included in a test.
usually best to develop three or four good distractors for each item
Guessing
“correct” answers simply by guessing, a correction for guessing is sometimes
used.
, if a correction for guessing is used, then
random guessing will do you no good. Some speeded tests are scored so that the correction for the guessing formula includes only the items that were attempted— that is, those that were not attempted are not counted either right or wrong. In this case, random guessing and leaving the items blank have the same expected effect.
How about cases where you do not know the right answer but can eliminate one
or two of the alternatives?
advise you to guess
The correction formula assumes that you are equally likely to respond to each of the four categories. For a four-choice item, it would estimate your chance of getting the item correct by chance alone to be 1 in 4. However, if you can eliminate two alternatives, then the chances are actually 1 in 2. This gives you a
slight advantage over the correction formula
students are more likely to guess when …
they anticipate a low grade on a test than when they are more confident
discourage guessing by
giving students partial credit for items left blank
guessing threshold
describes the chances that a low-ability
test taker will obtain each score
Likert format, the category scale, and the Q-sort
do not judge any response as “right”
or “wrong.” Rather, they attempt to quantify the characteristics of the response
essay
commonly used in classroom evaluation, and the Educational Testing Service now uses a writing sample as a component of its testing
programs.
reliability of the scoring procedure should be assessed by determining the association between two scores provided by independent scorers.
In practice, however, the psychometric properties of essay exams are rarely evaluated
Likert format
indicate the degree of agreement with a particular attitudinal question.
some applications, six options are used
to avoid allowing the respondent to be neutral.
Scoring requires that any negatively worded items be reverse scored and the responses are then summed. This format is especially popular in measurements of
attitude
subjected to factor analysis,
test developers can find groups of items that go together
issues with likert
challenged the appropriateness of using traditional parametric statistics to analyze Likert responses because the data are at the ordinal rather than at an interval level
category format.
similar to the Likert format but that uses an even greater number of choices
10-point rating systems
issues with category
responses to items on 10-point scales are affected by the groupings of the people or things being rated
A particular player rated as a 6 when
he was on a team with many outstanding players might be rated as a 9 if he were
judged with a group of poorly coordinated players
People will change ratings depending
on context
When given a group of objects to rate, subjects have a tendency to spread their responses evenly across the 10 categories
reliability and validity may be higher if all response options are clearly labeled, as opposed to just labeling the categories at the extremes
how to avoid problems with category format
problem can be avoided if the endpoints
of the scale are clearly defined and the subjects are frequently reminded of the
definitions of the endpoints.
, testers might show them films that
depict the performance of a player rated as 10 and other films showing what the rating of 1 means. Under these circumstances, the subjects are less likely to offer a response that is affected by other stimuli in the group
category format - 9 vs10pt
4-7
N. H. Anderson (1991) has
found that a 10-point scale provides substantial discrimination among objects for a wide variety of stimuli. Some evidence suggests that increasing the number of response
categories may not increase reliability and validity.
increasing the number of choices beyond nine or so can reduce reliability because responses may be more likely to include an element of randomness when there are so many alternatives that respondents cannot clearly discriminate between the fine-grained choices
optimum number of categories is
between four and seven. Reliability suffers when fewer than four categories are used, but the increase in reliability does not increase much when form than seven
categories are available
visual analogue scale
the respondent is given
a 100-millimeter line and asked to place a mark between two well-defined endpoints
popular for measuring self-rated
health. However, they are not used often for multi-item scales, because scoring is timeconsuming
Checklists
adjective checklist - long list of adjectives and indicates whether each one is characteristic of himself or herself. - dichotomous
Q-Sorts
can be used to describe oneself or to provide ratings of others
a subject is given statements and asked to sort them into nine piles.
gave observers 100 statements about personal characteristics. The statements were sorted into piles that indicated the degree to which they appeared to describe a given person accurately
Most of the cards are usually placed in piles 4, 5, and 6.
frequency of items placed in each of the categories usually looks like a bell-shaped curve
item difficulty
defined by the number of people who get a particular item correct.
ie, if 84% of the people taking a particular test get item 24 correct, then the difficulty level for that item is .84.
The higher the proportion of people who get the item correct, the easier the item.
optimal difficulty level for items is usually
halfway between 100% of the respondents getting the item correct and the level of success expected by chance alone.
optimum difficulty level for a four-choice item is approximately .625.
To arrive at this value, we take the 100% success level (1.00) and subtract the chance performance level (.25)/2 and add chance
In constructing a good test, one must also
consider human factors
, though items answered correctly by all students will have poor psychometric qualities, they may help the morale of the students who take the test. A few easier items may help keep test anxiety in check, which in turn adds to the reliability of the test
Discriminability
examine the relationship between performance on particular items and performance on the whole test
item discriminability
determines whether the people who have done well on particular items have also done well on the whole test
The Extreme Group Method
compares people who have done well with those who have done poorly on a test.
find the students with test scores in the top third and those in the bottom third of the class.
Then, you would find the proportions
of people in each group who got each item correct.
The difference between these
proportions is called the discrimination index
The Point Biserial Method
find the correlation between performance on the item and performance on the total test
. If this value is negative
or low, then the item should be eliminated from the test. The closer the value of the index is to 1.0, the easier the item.
item characteristic curve.
graph for each individual test item. On these individual item graphs, the total test score is
plotted on the horizontal (X) axis and the proportion of examinees who get the item
correct is plotted on the vertical (Y) axis
total test score is used as an estimate of
the amount of a “trait” possessed by individuals.
relationship between performance on the item and performance on the test gives some information about how well the item is tapping the information we want.
item characteristic curve for a “good” test item
gradual positive slope of the line demonstrates that the proportion of people who pass the item gradually increases as test scores increase. This means that the item successfully discriminates at all levels of test performance.
“none of the above.”
Students who are exceptionally knowledgeable in the subject area can sometimes rule out all the choices even though one of the alternatives has actually been designated as correct.
Classical test Theory
score is derived from the sum of an individual’s responses to various items, which are sampled from a larger domain that represents a specific trait or ability.
item response theory
item analysis
each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability level of each test taker.
With the computer, items can be sampled, and the specific range of items where the test taker begins to have difficulty can be identified
testers can make an ability judgment without subjecting the test taker to all of the test items.
advantages of item response theory
builds on traditional models of
item analysis and can provide information on item functioning, the value of specific
items, and the reliability of a scale
most important message for the test taker is that his or her score is no longer
defined by the total number of items correct, but instead by the level of difficulty
of items that he or she can answer correctly
implications of IRT
some people believe that IRT was the most important development in psychological testing in the second half of the 20th century
various approaches to the construction of tests using IRT- two dimensions
difficulty and discriminability. Other approaches add a third dimension for the probability that test takers with the lowest levels of ability will get a correct response
advantage of IRT
easily adapt them for computer administration. The computer can rapidly identify the specific items that are required to assess a particular ability level
test takers do not have to suffer the embarrassment of attempting multiple items
beyond their ability. or waste time and effort on items below their capability
disadvantage of IRT
Though the precision of the test is best for those at average ability levels, those with the lowest or highest ability levels are not well assessed by this type of test
“rectangular conventional”
requires that test items be selected to create a wide range in level of difficulty.
e items are pretested and selected to cover evenly the span from easiest to most
difficult.
problem with this approach is that only a few items of the test are appropriate for individuals at each ability level; that is, many test takers spend much of their time responding to items either considerably below their ability level or too difficult to solve. As a result, measurement precision is constant across the range of test-taker abilities but relatively low for all people,
IRT addresses traditional problems in test construction well
IRT can handle items that are written in different formats
IRT can identify respondents with unusual response patterns and offer insights into cognitive processes of the test taker
. Use of IRT may also reduce the biases against people who are slow in completing test problems. In other words, by presenting questions at the test taker’s ability level, IRT and computer-adaptive testing allow the defined time spent on taking the test to be used most efficiently by test takers
internal criteria
Item analysis has been persistently plagued by researchers’ continued dependence on
internal criteria, or total test score, for evaluating items.
The examples we have just given demonstrate how to compare performance on an item with performance on the total test.
external criterion
, if you were building a test
to select airplane pilots, you might want to evaluate how well the individual items
predict success in pilot training or flying performance. The advantages of using external rather than internal criteria against which to validate items were outlined by
rarely used in practice
Linking Uncommon Measures
One challenge in test applications is how to determine linkages between two different measures.
Interpretation of the test results for students who took the test at different times requires that scores on each administration have the same meaning, even though the tests include different items—that is, we assume that a score of 600 means the same thing for two
students even though the two students completed different tests.
not feasible to compare the wide array of commercial and state achievement tests to one another. Further, they concluded that developing transformation methods for individual scores should not be done
Items for Criterion-Referenced Tests
traditional use of tests requires that we determine how well someone has done on a test by comparing the person’s performance to that of others.
criterion-referenced test compares performance with some clearly defined
criterion for learning. This approach is popular in individualized instruction
programs. For each student, a set of objectives is defined that state exactly what
the student should be able to do after an educational experience.
used to determine whether
this objective had been achieved.
. Many educators regard criterion-referenced
tests as diagnostic instruments
first step in developing criterion-referenced tests
clearly specifying the objectives by writing clear and precise statements about what the learning program is attempting to achieve
These statements are usually stated in terms of something the student will be able to do
To evaluate the items in the criterion-referenced test
one should give the test to two groups of students—one that has been exposed to the learning unit and one that has not
V distribution
bottom of the V is the antimode, or
the least frequent score. This point divides those who have been exposed to the unit
from those who have not been exposed and is usually taken as the cutting score or
point, or what marks the point of decision
Criterion-referenced tests offer many advantages to newer educational approaches.
computer-assisted instruction, each student works at his or her own pace on an individualized program of instruction, after which a criterion-referenced test is used to evaluate progress.
Students who pass the test can move on to the next unit. Students who do not pass can repeat some of the instruction until they pass