Lecture 4 Test Development 2: Item Analysis & Score Meaning (Catherine) Flashcards
To Provide a Summary of Lecture 4: Item Analysis & Score Meaning for the purposes of revision
What are the stages of Test Development?
- Test Conceptualisation
- Test Construction
- Test Try Out
- Item Analysis
- Test Revision
- Repeat stages 4 & 5 as required
What constitutes a ‘Good’ Item?
A Good Item has: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test) It also needs to *differentiate between test-takers & have *good discrimination (high scorers score well, low scorers score poorly on the item)
What constitutes a ‘Bad’ Item?
A Bad item lacks differentiation between test-takers & has poor discrimination (low scorers score well, high scorers score poorly on the item) The item requires revision or deletion It also has poor: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test)
How are good items identified?
*Good items are identified through the process of item analysis which involves the analysis of individual items & overall test-scores
Why is item analysis important?
It allows a test developer to identify items that do not perform well, which can be revised or discarded, thus improving the reliability & validity of the test
*The test also needs to be brief to prevent test fatigue, thus it is important to have only the best items
How does item analysis identify good items?
Item analysis is typically quantitative although qualitative approaches are also possible such as expert panels & “think aloud” exercises
What determines the type of item analysis implemented by the test developer?
- The analysis will be dependent on the test developers objectives i.e. the purpose of the test
- Some Test Developers what to create ‘clusters’ or item scales which hang together leading to emphasis on internal consistency & construct validity as the cluster is measuring the same phenomena
- Some test developers may want to create a test that predicts some criterion (e.g job performance) & may not be so concerned with how well clusters of items hang together (emphasising criterion-related validity)
What are the analytic tools that Test Developers use to analyse and select items, & which are the most important?
- Item-difficulty index
- item-discrimination index
- item-validity index
- Item-reliability index
Item Characteristics of particular interest are item-difficulty & item-discrimination
What are the key ways of calculating the Item-difficulty index
- The proportion of test-takers who answered the answer correctly (p)
- p can range from 0 (no-one answers correctly) to 1 (everyone answered correctly)
- Each item on a test has a corresponding p value (item 1 = p1; item 2 = p2)
- This statistic is also referred to as the ‘item endorsement index” in non-achievement tests (e.g. personality tests)
Give a few quick examples to emphasis how simple it is to calculate the item-difficulty index
- if an item is answered corrected by 50 out of 100 (50% of) people p = 0.5
- if an item is answered corrected by 75 out of 100 (75% of) people p = 0.75
- if an item is answered corrected by 95 out of 100 (95% of) people p = 0.95
Thus Item 1 is harder than item 2, and item 2 is harder than item 3
What is the ‘Ideal’ level of item difficulty for a test as a whole and how is it calculated?
- The optimal average item difficulty is 0.5 with individual items ranging from 0.3 to 0.8
(0. 3 = somewhat difficult to 0.8 = somewhat easy) - Items that no-one answers correctly (p = 0) or everyone answers correctly (p = 1) do not discriminate between test-takers
- The Index of item difficulty for a test as a whole is calculated as the average of all the p-values for the test items
When calculating the ‘Ideal’ level of item difficulty for a test as a whole what also needs to be taken into account when analysing items that use the selected-response format?
*The effect of guessing needs to be taken into account when analysing items that use the selected-response format
*Optimal item difficulty is the mid-point between 1 and the probability of guessing the answer correctly:
*True-false items (2 options)
the probability of guessing correctly = 0.5
therefore the optimal average item difficulty is p = 1 + 0.5 /2 = 0.75
*Multiple-Choice items (5 options)
the probability of guessing correctly = 0.2
therefore the optimal average item difficulty is p = 1 + 0.2 /2 = 0.60
*Multiple-Choice items (4 options)
the probability of guessing correctly = 0.25
therefore the optimal average item difficulty is p = 1 + 0.25 /2 = 0.625 = 0.63
What is the Item-discrimination index?
- The Item-discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure
- An item is a good item if:
- Most of the high scorers on the test overall answer the item correctly
- Most of the low scorers on the test overall answer the item incorrectly
- An item is not doing its if it is more likely to be answered correctly by test-takers who least understand the subject matter than those who most understand the subject matter.
What are the key properties of the Item-discrimination index?
- Symbolised by d
- Compares performance on a particular item by the high ability group & the low ability group
(i. e. the top 27% and the bottom 27%) - Items that discriminate well will have a high positive score (to a maximum of 1)
- A negative d value is a red flag as it means low test takers are doing better on that item than high test takers
What are the key ways of calculating the Item-discrimination index?
The difference between:
*The no. of high scorers answering the item correctly (U)
*The no. of low scorers answering the item correctly (L)
*Divided by the number of scores in each group (n)
d = U - L / n
If the distribution is normal then the number in U will equal the number in L. Irrespective of the distribution, the division of U and L by the number in each group will yield the same result
What are the key points to remember about Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?
*They are graphic representations of item difficulty & item discrimination
*ICCs have the following characteristics:
-Ability to be plotted on the x-axis & probability of correct response plotted on the y-axis
-The steeper the slope the greater the discrimination between high & low scorers
-Item-difficulty is reflected in the shift of the ICC along the x-axis
The more difficult an item is, the more the ICC shifts to the right as fewer people have answered the item correctly