writing and evaluating test items Flashcards
how do you choose format of items?
Choice of format comes from objectives and purpose of the test
Item writing guidelines
- define clearly what you want to measure
- generate and item pool
- avoid long items (tedious to read)
- keep reading difficulty appropriate (education level)
- use clear and concise wording (avoid-double barreled and double negatives)
- mix positively and negatively worded items in the same test
- make sure items are culturally neutral as possible
- make content relative to the purpose
How to write MCQ items
vary position of correct answer
all distractors must be plausible
True/false Qs
Both statements same length
Equal numbers of both
5 types of item format
- Dichotomous format
- Polytomous Format
- Likert format
- category format
- checklists and q sorts
Dichotomous format
- 2 alternatives
- True/False
- Yes/No
Dichotomous format advantages
- ease of administration
- quick scoring
- requires absolute judgement
Dichotomous format disadvantages
- less reliable (50% of getting an item correct, less range of scores when it comes to analyses)
- encourages memorisation
- often truth is not black/white
Polytomous format
more than 2 alternatives
MCQ questions
Polytomous format- distractors
- incorrect alternatives
- ideal to have 3-4 distractors to retain pscyhometric properties
- must be as plausible as the correct answer
- no cute distractors
- make the test more reliable
- but difficult to find good distractors
Polytomous format- advantages
- easy to administer and score
- requires absolute judgement
- less likely to guess correctly than a dichotomous test
Correction for guessing
R – W/n – 1
Number of right answers minus the number of wrong answers divided by the number of choices for each item (minus 1)
R = number of correct
W = number of wrong
N = number of alternatives
Omitted answers are excluded in this calculation
Likert Format
Named after Likert, who 1st used it for attitude scale
- idicates degree of agreement
- 6-point scale (or even number of options) used to avoid the neutral response
- Reverse score negatively-worded items
- use statments
- popular for attitude and personality scales
Category Format
On a scale of 1 to 10…
Research suggests 7 best
Category Format- disadvantages
- Tendency to spread responses across all categories
- Susceptible to the groupings of things being rated (context)
- Element of randomness
When is Category Format used?
- People are highly involved with a subject
E.g., asking people in townships to rate service delivery
More motivated to make a finer discrimination - Want to measure the amount of something
E.g., road rage experienced in a given situation - Make sure your endpoints are clearly defined
Visual analogue scale
Checklists
- Common in personality measures
- A list of adjectives, check which ones describe you best
Q-sorts
- place statements into piles
- piles indicate degree to which you think a statement describes a person/yourself
- category format implicit here
Item analysis
Item analysis is a general term used to describe a set of methods used to evaluate test items. Item difficulty and item discriminability are the most basic of these methods.
Item Difficulty
- the proportion of people who get a particular item correct
- higher value= easier the item
- p = number answered item correctly / number taking the measure
Optimum difficulty level (ODL)
-between 0.30 and 0.70
Example: MCQ test with 4 alternatives
-4 answer options, therefore chance = 0.25
-Halfway between 100% and chance: (1.00 - 0.25)/2 = 0.375
-Add chance: 0.375 + 0.25 = 0.625
(Add chance because we require a difficulty level of at least chance)
-ODL = 0.625
exceptions to optimum difficulty level
- At times we need more difficult items e.g., selection process
- At times we need more easier items e.g., special education
- At times we need to consider other factors e.g., boost morale
Item discriminability
Have those who did well on particular items also done well on the overall test?
Good item discriminability when:
People who do well on test overall get the item correct (and vice versa)
Discrimination Index (di)
Higher values indicate better discriminability
Item discriminability- extreme groups method
-Calculated by looking at the number of people in the upper quartile who got the item correct divided by the number of people in the lower quartile who got the item correct
-Essentially subtracting item difficulty between top and bottom 25%
di = U/Nu – L/NL
Item discriminability: The point-biserial method
-Also known as item-total correlation
- Item correlations can also be used for Likert-type items, category format items, etc.
Again, good items should be those that have a positive item-total correlation
For example:
If an item on a questionnaire measuring schizophrenia symptoms has a high correlation with total scores on the overall questionnaire, then the item is good at measuring schizophrenia symptoms
Could use this correlation of an indicator to include or exclude from test/questionnaire in future
Include higher and exclude lower
Item characteristic curves (ICCs)
The relationship between performance on an item and performance on the overall tests tells us how well the item is tapping into what we want to measure.
-A graphical display of item functioning
Total test score plotted on X-axis
Proportion (i.e., 0.23, 0.50, etc.) getting the item correct plotted on Y-axis
-need discrete categories for scores.
Item response theory (irt)
A different model of psychological testing
-Makes extensive use of item analysis
-Computer generates items
Each of these items has a particular difficulty level
-Computer gives you an item
-If you answer it correctly, the next item will be of increased difficulty, if incorrectly, the next item will be of decreased difficulty
-Looks at what you can do and only gives you what it thinks you can handle
-Essentially, the test is ‘tailored’ to the individual
Example:
This person can answer most items correctly at the 0.30 (or 0.45 or 0.70, etc.) level of difficulty…
Rather than: This person got 30% or 45% or 70% on this test.
Test performance in irt- advantages
- Tests based on IRT can easily be adapted for computer administration
- Quicker tests
- Morale of test-taker is not broken down
- Reduces chances of cheating
Measurement precision: peaked conventional
- tests individuals at average ability best
- doesn’t assess high or low levels well
- high precision for average ability levels, low precision at either end
Measurement precision: rectangular conventional
- equal number of items assessing all ability levels
- relatively low precision across the board
Measurement precision: adaptive
- tests focuses on the range that challenges each individual test taker
- precision therefore high at every ability level
Criterion-referenced tests
-Compares performance with some objectively defined criterion
E.g., the extent to which performance on the QLT predicts success at stats in psychology
Develop tests based on learning outcomes
What is it that the student should be able to do?
E.g., At the end of this lecture you should be able to:
Describe an ICC
Calculate item discriminability
Calculate a point-biserial correlation
Evaluating items in Criterion-referenced tests
- 2 groups: 1 given the learning ‘unit’. 1 not given the learning ‘unit’
- collect scores; plot on graph and should form a V or U shape
-bottom curve is the anitmode
Limitations of Criterion-referenced tests
Tells you that you got something wrong, but not why
Emphasis on ranking students rather than identifying gaps in knowledge
‘Teaching to the test’
how does IRT differ to traditional testing methods?
it is defined by the level of difficulty of items answered correctly
Instead of total test score as a traditional method does