Wk 6 - Test construction Flashcards by Emma Richmond-Darvill

The item discrimination index, where the Upper and Lower Groups are defined by overall test score, can tell us how that single item contributes to a test’s… (x1)
Why? (x1)

Internal consistency
Question is describing the Item Discrimination Index for reliability - which tells you the extent to which the item is yielding a similar response to the other items in the scale.

How well did you know this?

Not at all

Perfectly

An examination has an average item difficulty index of .95. Which of the following is most likely given this information (x1)
Why? (2)

The examination is very easy.
Item difficulty index tells you what percentage of test-takers got a question correct.
In this e.g., on average, 95% of test-takers were getting the questions correct

How well did you know this?

Not at all

Perfectly

The following frequency data refers to a question on a four option multiple-choice examination; the options are denoted (i), (ii), (iii), (iv) below. Students are divided into an Upper Group (top third of class based on overall exam score) and a Lower Group (bottom third of class based on overall exam score).

Upper Group: (i) 12, (ii) 35, (iii) 0, (iv) 3.

Lower Group: (i) 23, (ii) 21, (iii) 0, (iv) 6.

For example, this data indicates that 12 people from the Upper Group chose Option (i). What is the item discrimination index for this item if the correct answer is Option (ii)?
And how calculated?

.28
Because:
U = 35
L = 21
nU = 50 (summing across the top row: 12 + 35 + 0 + 3)
nL = 50 (summing across the bottom row: 23 + 21 + 0 + 6)
d = (35/50) - (21/50) = .28

How well did you know this?

Not at all

Perfectly

True or false, and why? (x2)
On a norm-referenced aptitude test, the ‘optimal difficulty’ of an item is defined as a point halfway between chance and everyone getting the answer WRONG.

False
Optimal difficulty is actually halfway between chance and everyone getting the answer CORRECT
That’s why the optimal difficulty calculation requires you to add the chance of guessing to 100% before dividing by 2 to find the halfway point.

How well did you know this?

Not at all

Perfectly

True or false, and why? (x1)
The item discrimination index for validity tells us the extent to which the item contributes to the scale’s correlation with a relevant criterion measure.

True
Higher the item discrimination index for validity, the more that item is contributing to the scale’s relationship with the external criterion measure in question.

How well did you know this?

Not at all

Perfectly

What is the optimal item difficulty index for two option multiple-choice questions in a norm-referenced achievement test?
How calculated? (x2)

.75

Chance for a 2 option multiple choice is 50%. (50% + 100%)/2 = 75%.

How well did you know this?

Not at all

Perfectly

True or false?
The item discrimination index for reliability tells you the extent to which people are responding to that item in the same way as they are responding to the other items in the scale.

True

How well did you know this?

Not at all

Perfectly

True or false, and why? (x2)
If the item discrimination index for reliability is -1 (minus one) then it means that the item in question cannot distinguish between high and low scorers IN ANY WAY.

False.
If true, d would = 0
d = -1 means it is discriminating - just in the opposite way that you would predict (it probably means it was a reverse-scored item and you forget to recode it)

How well did you know this?

Not at all

Perfectly

Upper Group: (i) 34, (ii) 89, (iii) 21, (iv) 3.

Lower Group: (i) 67, (ii) 43, (iii) 7, (iv) 29.

For example, this data indicates that 34 people from the Upper Group chose Option (i). What is the item discrimination index for this item if the correct answer is Option (ii)?

n in Upper Group = 147 (34+89+21+3)
n in Lower Group = 146 (67+43+7+29)
U = 89
L = 43
Applying Item Discrimination Index formula: (89/147) - (43/146) = .61 - .29 = .32

How well did you know this?

Not at all

Perfectly

Upper Group: (i) 7, (ii) 12, (iii) 0, (iv) 31.

Lower Group: (i) 9, (ii) 35, (iii) 0, (iv) 6.

For example, this data indicates that 12 people from the Upper Group chose Option (ii).

True or false, and why? (x2)
This item appears to have a redundant distractor.

True
Nobody in either the upper or lower groups choose option (iii),
Suggesting that it was obviously incorrect and might be a redundant distractor (in that it failed to distract anyone)

How well did you know this?

Not at all

Perfectly

Upper Group: (i) 7, (ii) 12, (iii) 0, (iv) 31.

Lower Group: (i) 9, (ii) 35, (iii) 0, (iv) 6.

For example, this data indicates that 12 people from the Upper Group chose Option (ii).

True or false, and why? (x4)
There are grounds for suspecting that this item might contain a scoring error or be worded in a misleading way

True
Most people in the Upper Group choose option (iv) rather than the option designated as “correct” (i.e. option ii),
Despite most people in the Lower Group choosing option (ii).
This raises the suspicion that there might be a scoring error or problem with the question - and hence that it should be double-checked.
This issue would also be flagged by the fact that this item would yield a negative item discrimination index

How well did you know this?

Not at all

Perfectly

Upper Group: (i) 16, (ii) 6, (iii) 5, (iv) 7.

Lower Group: (i) 3, (ii) 23, (iii) 4, (iv) 4.

For example, this data indicates that 16 people from the Upper Group chose Option (i).
How would you best describe the data, assuming the examiner has designated Option (i) to be the correct answer? (x1)
Why? (x2)

It is a difficult item, but not problematic.
Because, in the upper group, a greater proportion of people are choosing the right answer.
The fact that the lower group are going for a different option is fine – they are probably being appropriately misled as a result of not knowing the material as well

How well did you know this?

Not at all

Perfectly

Is it appropriate to use item discrimination and item difficulty for speed tests? (x1)
Why? (x2)

No
Because what discriminates between people is not how many they get correct but many they complete
(as compared with power tests where the focus is on which items are correct)

How well did you know this?

Not at all

Perfectly

True or false, and why?

For power tests, it is appropriate to calculate both Item Discrimination and Item Difficulty Indices.

True

It is the difficulty of the questions that is doing the job of separating high scorers from low scores

How well did you know this?

Not at all

Perfectly

What are the five steps involved in creating and evaluating a test?

Test conceptualisation - what, why and how
Create the materials needed - e.g. Likert scale items
Design/run studies to assess validity, reliability, standardisation and item quality
Test revision - if it all works, consider improvements
Release it into the wild!

How well did you know this?

Not at all

Perfectly

Give examples of some of the issues to be considered when conceptualizing a test (x4)

Why is it worth creating a new test? What has been done before? What does your test offer beyond that?
Who is using?
Context - how many test takers/yr?
Length - reliability vs practicality

How well did you know this?

Not at all

Perfectly

Give an example of the sort of practicalities that might need to be considered when conceptualizing a test (x5)

Training/skill of administrator – eg real IQ test takes a full training course for proper competence
Test used by older adults – text big enough to read
How big is your budget?
How much time do you have
How many people are you testing? (costs etc)
Need internet access?

How well did you know this?

Not at all

Perfectly

What sort of ethical issues might you need to consider when conceptualizing a test?

Anonymity - eg Qualtrics now accessible by CIA

Sensitive/offensive content - e.g. sex/impulsivity, or age when single Ps could be ID’d that way

How well did you know this?

Not at all

Perfectly

What sort of issues should you consider when creating the materials for a new test? (x7)

Format of test items - Selected/constructed response
How to score - Likert etc
Whether items give sufficient variability in responses (if it’s a norm-referenced test)
Content validity, do questions measure the trait we’re aiming at
If criterion-referenced, do items map onto that criteria? Eg exam questions map onto course content?
How many items?
Face validity – do you have/want it?

Give examples of some of the key considerations you might need to take into account when drafting written content for a test? (x4)

Avoid ambiguity/lack of clarity – are they getting right answers for the right reasons?
Include reversed questions – eg so they can agree and disagree
If multiple choice - Plausible distractors
Can they guess correctly without actually knowing?

What are selected-response questions? (x3)

Multiple choice
Matching options – eg provide 3 different types of test design, ask them to match with the list what statistical analysis would be appropriate
True/false

What are constructed-response questions? (x2)

Fill in the blank

Write an essay/account

In a multiple choice question, what is the “stem”?

What are the other components? (x2)

The question part
Correct alternative
Distractors

Describe four strategies you could use for pilot research into your new test.

Give open-ended version of test, eg written response over selected, to gather ideas for options
Give preliminary/fake version to people for comment
Get people to give running commentary on their thoughts as they complete a draft of test
Verbal protocol – what goes through their mind while doing the test? Gives an idea of how people are interpreting your questions

Describe five strategies you could use for evaluating the quality of items in an achievement test.

Item difficulty index Item discrimination index - reliability: similarity to other items Item discrimination index - validity: contribution to correlation with criterion If multiple choice – examine the pattern of responses across In-depth - wrong by those you'd expect to get right? Scoring above chance by those you'd expect to get it wrong?

Describe four strategies you could use to evaluate item quality in a personality-type test.

Examine histogram/frequency table of responses to each item – is there an appropriate spread? Item discriminaton index - reliability: similarity to other items Item discrimination index - validity: contribution to correlation with criterion In-depth - wrong by those you'd expect to get right? Scoring above chance by those you'd expect to get it wrong?

What is the item difficulty index? (x1) And how to calculate? (x1) And interpret? (x1)

(p) is proportion of people who got it correct – between zero and one p = number of people correct/Total N Turn decimal into percentage of those who got it correct (by x100)

What is the optimal item difficulty (p) for a multiple choice question in a norm-referenced aptitude test? (x1) How to calculate? (x3)

Optimal is halfway between chance and everyone being correct By adding chance level to 100% and dividing by two: Eg 5-response option = 1/5 chance of guessing, so (20% + 100%)/2 = 60%.

How can you examine the spread of scores for items in a personality-type test?

Plot a separate histogram or output a frequency table for each item Are the scores spread across all response options for all items, without too much skew?

What is the item discrimination index for reliability?

How similar is the item to the rest of the items in the scale?

How do you calculate the item discrimination index (d) with respect to internal consistency? (x5)

Work out the total score for the quiz (count up correct answers for each person). Define Upper and Lower groups as top 25% and bottom 25% based on total score For each item, count up people in high scoring group who got the item correct, and number of people in low scoring group who got it correct (ignore other Ps) Divide number of correct people in lower group by total in lower group, and subtract from Number of correct people in upper group divided by total in that group

What do the different possible values of the item discrimination index mean? (x3)

If d = 1, all those in upper group got the item correct, and all in lower wrong d = 0, no difference in performance of upper/lower groups d = -1, all low scorers get it right and all high scorers get it wrong (discriminating, but opposite of what you wanted - scoring error?)

What is the item discrimination index with respect to criterion validity? (x2)

Does the item contribute to the scale’s relationship to an external criterion measure? Higher the absolute number (-1 to 1, higher correlation with criterion variable

How do you calculate the item discrimination index for validity? ie how to evaluate items in a criterion-referenced test (x4)

Define upper and lower groups by upper and lower 25% of scores on external criterion For each item, count up people in high scoring group who got the item correct, and number of people in low scoring group who got it correct (ignore other Ps) Divide number of correct people in lower group by total in lower group, and subtract from Number of correct people in upper group divided by total in that group

How could you go about evaluating items for a criterion-referenced test? (x4)

Test one group of people who should fulfill the criterion (e.g. experienced colonoscopists). Compare with second group of people who should not (e.g. novices who shouldn’t know anything about colonoscopy). Work out item discrimination indices with these two groups substituted for the Upper Group (experienced colonoscopists) and Lower Group (novices). If multiple choice, see if novices perform above chance on any questions

What can examining the pattern of responses across items for high and low scorers in a multiple-choice test tell you? (x2)

You can see how people are reacting to the distractors in the response options Are disproportionate number high scorers getting it wrong? Or vice versa? ie when they get it wrong, is it because all the distractors are im/plausible?

Why does Mark do those horrid double true/false quiz questions? (x2)

Because pass mark for the course is 50%, | So if he split them up, you could hypothetically pass the course not knowing anything at all

Give an example of how Mark has used previous exams as pilot for current (x3)

Open-ended pilot method: Old versions of test had written component Used to generate multiple choice responses now that course is so much bigger

What might be a typical strategy for designing a validity/reliability study? (x5)

To test correlation between Ps test scores and actual job performance rating would establish the criterion validity coefficient Recruit lots of Ps Get them all to do test Get an expert to rate their RL performance Then test which items don't/contribute to reliability/validity

What is the ideal item difficulty (p) for a norm-referenced test? (x1) Why? (x1)

50% | So half get it right and half wrong - maximum potential for discriminating

What might you do if the spread of scores for items in a personality-type test is skewed? (x2)

Just because individual items are skewed, doesn’t mean that the total score will be. Could try and correct the skew – perhaps through the wording, increase the speed asked about

What do we mean by 'criterion score'? (x3)

In addition to doing our new test, Ps also complete a different test that happens to validly and reliably measure the same/related thing This gives us a criterion score

How do you calculate the item discrimination index for personality-type tests? (x5)

Need to create arbitrary threshold/categories, that will take the place of correct and incorrect categories, Eg 1-2 as incorrect, and 3-5 as correct on a Likert scale, then For each item, count up people in high scoring group who got the item 'correct', and number of people in low scoring group who got it 'correct' (ignore other Ps) Divide number of correct people in lower group by total in lower group, and subtract from Number of correct people in upper group divided by total in that group

How do you interpret item discrimination index scores?

.36 is ok | Close to zero reveals a ceiling effect - both upper and lower groups got it right

What might you do during the test revision phase of test construction? (x2)

Using all the data from your item analysis, add, delete, or alter items and retest it. When finalised, might be time to test a standardization sample if required (assuming your existing study data can’t be used for this purpose)