Psychometrics - Test Items Flashcards
Where does the choice of format come from?
Objectives and purposes of the test
Eight tips for item writing
- Define clearly what you want to measure
- Generate an item pool
- Avoid long items
- Keep the reading difficulty appropriate
- Use clear and concise wording
- Mix positively and negatively worded items
- Cultural neutrality
- Make the content relevant to the purpose
Five categories of item format
- Dichotomous
- Polytomous
- Likert
- Category
- Checklists + Q-sorts
Examples of dichotomous format
True-False questions and Yes-No questions
Pros and cons of Dichotomous format
+ Easy to administer and score
+ Participants can’t opt for neutral
- Less reliable, due to less range of scores
- Encourages memorization in test setting
- Doesn’t account for the fact that the truth is often in shades of grey, not black and white
Example of polytomous format
MCQs
Tips for writing distractors
Minimum of three distractors (so 4 options in total), less is an issue due to too few options and more is an issue due to the difficulty of writing viable options. They need to be as plausible as the correct answer and avoid cute distractors.
Pros of polytomous format
+ Easy to administer and score
+ Requires absolute judgement
+ More reliable than dichotomous bc less change of guessing correctly
Why do we do correction for guessing?
With dichotomous and polytomous formats, it’s easy for people to guess the right answer, which doesn’t indicate good performance, just luck.
Formula for correction for guessing
Right-(Wrong/(number of alternatives - 1))
Debate around the neutral option in Likert format
It allows for people to opt for an indecisive option instead of working on the answer.
When do we use Likert?
When measuring the indicator of the degree of agreement, used in things like attitude or personality tests.
Problems with category format
- Tendency to spread responses across all categories
- Susceptible to the context
- Element of randomness
When do we use the category format?
- When people are highly involved with a subject
- When wanting to measure the amount of something
Checklists vs Q-sorts
Checklist: you check the things on a list that apply to you
Q-sort: Places statements into piles the best describe them
5 steps of item analysis
- Item difficulty
- Item discriminability
- Item characteristic curves
- Item response theory
- Criterion referenced tests
What is item difficulty?
Proportion of people who get the particular item right, so the higher the value, the easier the item.
What is the optimum difficulty level?
Between 0.3 and 0.7
How do we calculate optimum difficulty?
Halfway between 100% getting the item right and the level of success estimated by guessing, added to chance.
Example of OLD equation in a 4 option MCQ
4 options = 0.25 chance of guessing right
- (1-0.25)/2 = 0.375
- 0.25 + 0.375 = 0.625
0.625 is our ODL
When do we need to make exceptions about the ODL
More difficult: selection processes
Easier: special education
Others: boosting morale etc.
What does item discriminability tell us?
Have those who have done well on the particular item done well on the test?
What does the extreme groups method do?
Proportion of students in the upper extreme who got it right - the proportion of students in the lower extreme who got it right
What does the point-biserial method do?
It gives us the item-total correlation; if there is a correlation between how well people did on the item and how well they did on the test as a whole
Formula for point-biserial method
Correlation = ((Yn - Y)/Sy)(SQRT(Px/(1-Px))
Yn = mean tests score for those who got the item right Y = total mean Sy = stddev of all test takers Px = proportion of test takers who got the item correct
What can we do with the point-biserial correlation?
Use it to weed out bad questions after the pilot study; include items with higher correlation and exclude the lower.
What is plotted on the axes of the item characteristic curve?
X - total test score
Y - proportion getting the item correct
How does item response theory work?
A computer generates items for the test taker depending on their performance on the previous items, with the outcome being defined by the level of difficulty of the items answered correctly.
Pros of the item response theory
+ Tests based on IRT are great for computer administration
+ Quicker
+ Morale is maintained
+ Reduces cheating
Three kinds of measurement precision
- Peaked conventional: best for measuring the middle bracket, where average scores sit, it’s not good at measuring top and bottom achievers
- Rectangular conventional: equal number of items assessing all ability levels, but relatively low precision across the board
- Adaptive: test focuses on the range that challenges each individual test taker, making for overall high precision
What is a criterion-referenced test designed to do?
Compare test performance with some objectively define criterion, tests developed based on learning outcomes
How does one evaluate a criterion reference test?
Assess two groups, one given the learning unit and the other not. Collect their scores and put them on a graph.
Limitations of the criterion referenced test
- Tells you what you got wrong, but not why
- Emphasis on the ranking of students, not identifying what gaps in their knowledge exist
- It increases the risk of “teaching to the test”