Lecture 4 Test Development 2: Item Analysis & Score Meaning (Catherine) Flashcards
To Provide a Summary of Lecture 4: Item Analysis & Score Meaning for the purposes of revision
What are the stages of Test Development?
- Test Conceptualisation
- Test Construction
- Test Try Out
- Item Analysis
- Test Revision
- Repeat stages 4 & 5 as required
What constitutes a ‘Good’ Item?
A Good Item has: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test) It also needs to *differentiate between test-takers & have *good discrimination (high scorers score well, low scorers score poorly on the item)
What constitutes a ‘Bad’ Item?
A Bad item lacks differentiation between test-takers & has poor discrimination (low scorers score well, high scorers score poorly on the item) The item requires revision or deletion It also has poor: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test)
How are good items identified?
*Good items are identified through the process of item analysis which involves the analysis of individual items & overall test-scores
Why is item analysis important?
It allows a test developer to identify items that do not perform well, which can be revised or discarded, thus improving the reliability & validity of the test
*The test also needs to be brief to prevent test fatigue, thus it is important to have only the best items
How does item analysis identify good items?
Item analysis is typically quantitative although qualitative approaches are also possible such as expert panels & “think aloud” exercises
What determines the type of item analysis implemented by the test developer?
- The analysis will be dependent on the test developers objectives i.e. the purpose of the test
- Some Test Developers what to create ‘clusters’ or item scales which hang together leading to emphasis on internal consistency & construct validity as the cluster is measuring the same phenomena
- Some test developers may want to create a test that predicts some criterion (e.g job performance) & may not be so concerned with how well clusters of items hang together (emphasising criterion-related validity)
What are the analytic tools that Test Developers use to analyse and select items, & which are the most important?
- Item-difficulty index
- item-discrimination index
- item-validity index
- Item-reliability index
Item Characteristics of particular interest are item-difficulty & item-discrimination
What are the key ways of calculating the Item-difficulty index
- The proportion of test-takers who answered the answer correctly (p)
- p can range from 0 (no-one answers correctly) to 1 (everyone answered correctly)
- Each item on a test has a corresponding p value (item 1 = p1; item 2 = p2)
- This statistic is also referred to as the ‘item endorsement index” in non-achievement tests (e.g. personality tests)
Give a few quick examples to emphasis how simple it is to calculate the item-difficulty index
- if an item is answered corrected by 50 out of 100 (50% of) people p = 0.5
- if an item is answered corrected by 75 out of 100 (75% of) people p = 0.75
- if an item is answered corrected by 95 out of 100 (95% of) people p = 0.95
Thus Item 1 is harder than item 2, and item 2 is harder than item 3
What is the ‘Ideal’ level of item difficulty for a test as a whole and how is it calculated?
- The optimal average item difficulty is 0.5 with individual items ranging from 0.3 to 0.8
(0. 3 = somewhat difficult to 0.8 = somewhat easy) - Items that no-one answers correctly (p = 0) or everyone answers correctly (p = 1) do not discriminate between test-takers
- The Index of item difficulty for a test as a whole is calculated as the average of all the p-values for the test items
When calculating the ‘Ideal’ level of item difficulty for a test as a whole what also needs to be taken into account when analysing items that use the selected-response format?
*The effect of guessing needs to be taken into account when analysing items that use the selected-response format
*Optimal item difficulty is the mid-point between 1 and the probability of guessing the answer correctly:
*True-false items (2 options)
the probability of guessing correctly = 0.5
therefore the optimal average item difficulty is p = 1 + 0.5 /2 = 0.75
*Multiple-Choice items (5 options)
the probability of guessing correctly = 0.2
therefore the optimal average item difficulty is p = 1 + 0.2 /2 = 0.60
*Multiple-Choice items (4 options)
the probability of guessing correctly = 0.25
therefore the optimal average item difficulty is p = 1 + 0.25 /2 = 0.625 = 0.63
What is the Item-discrimination index?
- The Item-discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure
- An item is a good item if:
- Most of the high scorers on the test overall answer the item correctly
- Most of the low scorers on the test overall answer the item incorrectly
- An item is not doing its if it is more likely to be answered correctly by test-takers who least understand the subject matter than those who most understand the subject matter.
What are the key properties of the Item-discrimination index?
- Symbolised by d
- Compares performance on a particular item by the high ability group & the low ability group
(i. e. the top 27% and the bottom 27%) - Items that discriminate well will have a high positive score (to a maximum of 1)
- A negative d value is a red flag as it means low test takers are doing better on that item than high test takers
What are the key ways of calculating the Item-discrimination index?
The difference between:
*The no. of high scorers answering the item correctly (U)
*The no. of low scorers answering the item correctly (L)
*Divided by the number of scores in each group (n)
d = U - L / n
If the distribution is normal then the number in U will equal the number in L. Irrespective of the distribution, the division of U and L by the number in each group will yield the same result
What are the key points to remember about Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?
*They are graphic representations of item difficulty & item discrimination
*ICCs have the following characteristics:
-Ability to be plotted on the x-axis & probability of correct response plotted on the y-axis
-The steeper the slope the greater the discrimination between high & low scorers
-Item-difficulty is reflected in the shift of the ICC along the x-axis
The more difficult an item is, the more the ICC shifts to the right as fewer people have answered the item correctly
Describe the Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?
a Bad item will have:
- a negative slope or
- an inverted U (or n shaped) curve
A good item will have:
- a positive slope
- or an inverted L (ONLY if pass/fail criteria)
Give examples of good and bad items on the Item-discrimination index
- U=20, L=16, U-L=4, n=32, d=0.13 revise
- U=30, L=10, U-L=20, n=32, d=0.63 keep
- U=32, L=0, U-L=32, n=32, d=1.00 keep
- U=16, L=16, U-L=0, n=32, d=0.00 revise
- U=0, L=32, U-L=4, n=-32, d=-1.00 discard
What are the key points in relation to the Item-Validity Index?
- The Item-Validity index provides an indication of the degree to which a test measures what it purports to measure
- This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score on the criterion measure (riT)
- The Item-Score is calculated using the item’s item-difficulty index score (pi).
- We can use information about the item difficulty to work out the item’s validity
- IF the item correlates with what it is supposed to measure e.g. job performance, then it has good content validity, if it does not have a good correlation, then it is low in content validity and prone to removal
What are the key points in relation to the Item-Reliability Index?
- The Item-Reliability index is used to provide an indication of the internal consistency of a test, which helps us to determine whether all items consistently measure the same construct
- The selection of items that are associates most strongly with the total test score helps to increase a test’s internal consistency reliability
- There are 2, inter-related, ways to approach this:
- Calculate an item’s reliability index
- calculate the internal consistency reliability of the test/scale (Cronbach’s alpha)
How does one calculate an item’s reliability in relation to the Item-Reliability Index?
- This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score and the total test score: Si riT
- Again, the item-score standard deviation of an item (denoted by Si) is calculated using the item’s item difficulty index score (pi)
How does one calculate the test’s internal consistency reliability (AKA Reliability of Scales) in relation to the Item-Reliability Index?
- We are primarily interested in the reliability of clusters of items (i.e. of a scale or a test as a whole)
- measures of inter-item consistency are useful in examining whether a scale is homogeneous (all items are measuring the same thing)
- For a dichotomous response scale (true-false) use the Kuder-Richardson formula 20
- For a non-dichotomous response scale (Likert) use Cronbach’s Alpha
What are the key points in relation to Factor Analysis and Inter-Item consistency?
- Factor Analysis is a statistical tool useful in determining whether items on a test appear to be measuring the same thing
- Through Factor Analysis, items that do not load on the factor they were written to tap can be revised or eliminated
- If tto many items appear to be tapping a partucular area, the weskest of such items can be eliminated/removed
- Factor Analysis is not the same as cluster analysis
What are important issues when considering the meaning of test scores?
- The types of tests used
- Maximum-Performance tests (exams)
- Typical-Performance tests (personality tests)
- The intended use of the test
- do we want to understand performance relative to some objective standard (Criterion -referenced test - driving licence)
- do we want to understand performance relative to other people (Norm-referenced test - IQ test)
*How test results are expressed (which is dependent on the purpose of the test)