Lecture 4 Test Development 2: Item Analysis & Score Meaning (Catherine) Flashcards
To Provide a Summary of Lecture 4: Item Analysis & Score Meaning for the purposes of revision
What are the stages of Test Development?
- Test Conceptualisation
- Test Construction
- Test Try Out
- Item Analysis
- Test Revision
- Repeat stages 4 & 5 as required
What constitutes a ‘Good’ Item?
A Good Item has: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test) It also needs to *differentiate between test-takers & have *good discrimination (high scorers score well, low scorers score poorly on the item)
What constitutes a ‘Bad’ Item?
A Bad item lacks differentiation between test-takers & has poor discrimination (low scorers score well, high scorers score poorly on the item) The item requires revision or deletion It also has poor: *Content Validity *Criterion-Related Validity *Test-Retest Reliability *Internal Consistency & Reliability (just like a good test)
How are good items identified?
*Good items are identified through the process of item analysis which involves the analysis of individual items & overall test-scores
Why is item analysis important?
It allows a test developer to identify items that do not perform well, which can be revised or discarded, thus improving the reliability & validity of the test
*The test also needs to be brief to prevent test fatigue, thus it is important to have only the best items
How does item analysis identify good items?
Item analysis is typically quantitative although qualitative approaches are also possible such as expert panels & “think aloud” exercises
What determines the type of item analysis implemented by the test developer?
- The analysis will be dependent on the test developers objectives i.e. the purpose of the test
- Some Test Developers what to create ‘clusters’ or item scales which hang together leading to emphasis on internal consistency & construct validity as the cluster is measuring the same phenomena
- Some test developers may want to create a test that predicts some criterion (e.g job performance) & may not be so concerned with how well clusters of items hang together (emphasising criterion-related validity)
What are the analytic tools that Test Developers use to analyse and select items, & which are the most important?
- Item-difficulty index
- item-discrimination index
- item-validity index
- Item-reliability index
Item Characteristics of particular interest are item-difficulty & item-discrimination
What are the key ways of calculating the Item-difficulty index
- The proportion of test-takers who answered the answer correctly (p)
- p can range from 0 (no-one answers correctly) to 1 (everyone answered correctly)
- Each item on a test has a corresponding p value (item 1 = p1; item 2 = p2)
- This statistic is also referred to as the ‘item endorsement index” in non-achievement tests (e.g. personality tests)
Give a few quick examples to emphasis how simple it is to calculate the item-difficulty index
- if an item is answered corrected by 50 out of 100 (50% of) people p = 0.5
- if an item is answered corrected by 75 out of 100 (75% of) people p = 0.75
- if an item is answered corrected by 95 out of 100 (95% of) people p = 0.95
Thus Item 1 is harder than item 2, and item 2 is harder than item 3
What is the ‘Ideal’ level of item difficulty for a test as a whole and how is it calculated?
- The optimal average item difficulty is 0.5 with individual items ranging from 0.3 to 0.8
(0. 3 = somewhat difficult to 0.8 = somewhat easy) - Items that no-one answers correctly (p = 0) or everyone answers correctly (p = 1) do not discriminate between test-takers
- The Index of item difficulty for a test as a whole is calculated as the average of all the p-values for the test items
When calculating the ‘Ideal’ level of item difficulty for a test as a whole what also needs to be taken into account when analysing items that use the selected-response format?
*The effect of guessing needs to be taken into account when analysing items that use the selected-response format
*Optimal item difficulty is the mid-point between 1 and the probability of guessing the answer correctly:
*True-false items (2 options)
the probability of guessing correctly = 0.5
therefore the optimal average item difficulty is p = 1 + 0.5 /2 = 0.75
*Multiple-Choice items (5 options)
the probability of guessing correctly = 0.2
therefore the optimal average item difficulty is p = 1 + 0.2 /2 = 0.60
*Multiple-Choice items (4 options)
the probability of guessing correctly = 0.25
therefore the optimal average item difficulty is p = 1 + 0.25 /2 = 0.625 = 0.63
What is the Item-discrimination index?
- The Item-discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure
- An item is a good item if:
- Most of the high scorers on the test overall answer the item correctly
- Most of the low scorers on the test overall answer the item incorrectly
- An item is not doing its if it is more likely to be answered correctly by test-takers who least understand the subject matter than those who most understand the subject matter.
What are the key properties of the Item-discrimination index?
- Symbolised by d
- Compares performance on a particular item by the high ability group & the low ability group
(i. e. the top 27% and the bottom 27%) - Items that discriminate well will have a high positive score (to a maximum of 1)
- A negative d value is a red flag as it means low test takers are doing better on that item than high test takers
What are the key ways of calculating the Item-discrimination index?
The difference between:
*The no. of high scorers answering the item correctly (U)
*The no. of low scorers answering the item correctly (L)
*Divided by the number of scores in each group (n)
d = U - L / n
If the distribution is normal then the number in U will equal the number in L. Irrespective of the distribution, the division of U and L by the number in each group will yield the same result
What are the key points to remember about Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?
*They are graphic representations of item difficulty & item discrimination
*ICCs have the following characteristics:
-Ability to be plotted on the x-axis & probability of correct response plotted on the y-axis
-The steeper the slope the greater the discrimination between high & low scorers
-Item-difficulty is reflected in the shift of the ICC along the x-axis
The more difficult an item is, the more the ICC shifts to the right as fewer people have answered the item correctly
Describe the Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?
a Bad item will have:
- a negative slope or
- an inverted U (or n shaped) curve
A good item will have:
- a positive slope
- or an inverted L (ONLY if pass/fail criteria)
Give examples of good and bad items on the Item-discrimination index
- U=20, L=16, U-L=4, n=32, d=0.13 revise
- U=30, L=10, U-L=20, n=32, d=0.63 keep
- U=32, L=0, U-L=32, n=32, d=1.00 keep
- U=16, L=16, U-L=0, n=32, d=0.00 revise
- U=0, L=32, U-L=4, n=-32, d=-1.00 discard
What are the key points in relation to the Item-Validity Index?
- The Item-Validity index provides an indication of the degree to which a test measures what it purports to measure
- This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score on the criterion measure (riT)
- The Item-Score is calculated using the item’s item-difficulty index score (pi).
- We can use information about the item difficulty to work out the item’s validity
- IF the item correlates with what it is supposed to measure e.g. job performance, then it has good content validity, if it does not have a good correlation, then it is low in content validity and prone to removal
What are the key points in relation to the Item-Reliability Index?
- The Item-Reliability index is used to provide an indication of the internal consistency of a test, which helps us to determine whether all items consistently measure the same construct
- The selection of items that are associates most strongly with the total test score helps to increase a test’s internal consistency reliability
- There are 2, inter-related, ways to approach this:
- Calculate an item’s reliability index
- calculate the internal consistency reliability of the test/scale (Cronbach’s alpha)
How does one calculate an item’s reliability in relation to the Item-Reliability Index?
- This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score and the total test score: Si riT
- Again, the item-score standard deviation of an item (denoted by Si) is calculated using the item’s item difficulty index score (pi)
How does one calculate the test’s internal consistency reliability (AKA Reliability of Scales) in relation to the Item-Reliability Index?
- We are primarily interested in the reliability of clusters of items (i.e. of a scale or a test as a whole)
- measures of inter-item consistency are useful in examining whether a scale is homogeneous (all items are measuring the same thing)
- For a dichotomous response scale (true-false) use the Kuder-Richardson formula 20
- For a non-dichotomous response scale (Likert) use Cronbach’s Alpha
What are the key points in relation to Factor Analysis and Inter-Item consistency?
- Factor Analysis is a statistical tool useful in determining whether items on a test appear to be measuring the same thing
- Through Factor Analysis, items that do not load on the factor they were written to tap can be revised or eliminated
- If tto many items appear to be tapping a partucular area, the weskest of such items can be eliminated/removed
- Factor Analysis is not the same as cluster analysis
What are important issues when considering the meaning of test scores?
- The types of tests used
- Maximum-Performance tests (exams)
- Typical-Performance tests (personality tests)
- The intended use of the test
- do we want to understand performance relative to some objective standard (Criterion -referenced test - driving licence)
- do we want to understand performance relative to other people (Norm-referenced test - IQ test)
*How test results are expressed (which is dependent on the purpose of the test)
What different types of tests are there?
- Maximum-Performance tests (exams)
- Test-takers do their best work, so their ability either achieved or potential is being tested
- Achievement tests - measure what we have learned
- Aptitude tests - measures what we are capable of
- Used to gain admission to university or employment
- Typical-Performance tests (personality tests)
- Focus is on what test-takers actually do or what they are really like rather than what they are capable of
- Exemplified by personality tests, which have no correct or incorrect answers
How do we make meaning of test scores?
Depends on the intended use of the test:
- do we want to understand performance relative to some objective standard
- Criterion -referenced test - e.g.driving licence
- do we want to understand performance relative to other people
- Norm-referenced test - e.g. IQ test
The purpose of the test informs how test results are expressed
How do we make meaning of test scores with Criterion -referenced tests (we want to understand performance relative to some objective standard, e.g. driving licence)?
- Scores are expressed in terms of the absolute level of knowledge, skills, or ability achieved (i.e. mastery) NOT in comparison with other people
- Mastery does not constitute “complete” knowledge, but rather a high level of knowledge of a given area, e.g. 80% / HD
- Criterion Referenced testing is common in assessing achievement in educational settings
- AKA: Domain, content or objective reference tests
What are some of the advantages & Disadvantages of Criterion -referenced tests?
- We are able to compare test-takers with an “absolute standard” of perfection
- Examples of test scores are: % correct and letter grades (A-E; HD-P)
- Key advantage: these scores tell us about the test-taker’s knowledge of the content of a domain of interest
- Key Disadvantage: There is no way of illustrating the scores in a generalised fashion
How do we make meaning of test scores with Norm-referenced test (where we want to understand performance relative to other people)?
- This involves comparing the test-taker with similar others (i.e. how you sit relative to your peers)
- Uses Norms : a known distribution of scores for some specified reference group
- Norms are critical to understanding the meaning of test scores
- if norms are out of date, a test-taker’s scores cannot be interpreted meaningfully
- Norm-referenced tested used derived scores NOT raw scores
Why are Norm-referenced tested used derived scores NOT raw scores?
Derived scores make test interpretation easier
*by ensuring scores from different tests are more comparable by expressing them in the same metric
*To help us make more meaningful interpretations of test scores
NB: *We must have accurate raw scores to get accurate derived scores
*No amount of statistical manipulation can make up for the use of a poor test or for making mistakes in scoring
-If a raw score has been normed, (e.g. converted to a z score) then we can compare it in relation to M and SD so can compare our test with another test.
What has been proposed to remedy some of the problems with norm-referenced scoring?
Standardised scoring systems have been proposed to remedy some of the problems with norm-referenced scoring
What are the 5 types of Standardised scoring within norm-referenced scoring?
*Linear Standard Scoring
*Rank within Groups
*Normalised Standard Scores
also
*Range of scores within a group
*Status of those obtaining the same score
These scores are relatively independent of content difficulty because they base a test-taker’s score on the performance of other people in the comparison group
What are the main properties of Linear Standard Scoring?
- These scores are based on the Standard Deviation
- They have properties making them more valuable in research than most other derived scores
- All Linear standard scores tell us the location of an examinee’s raw score in relation to the mean of some specified group & in terms of the group’s standard deviation
- Examples of Linear Standard Scoring are:
- z-scores
- T-scores
- Deviation IQ scores
What are the properties making Linear Standard Scores more valuable in research than most other derived scores?
- For every test & group each score gives the same mean and SD
- These scores retain the shape of the raw distribution, changing only the calibration or metric
- They permit inter-group and inter-test comparisons that are not possible with most other types of score
- They can be treated mathematically (e.g. averaged) in ways that other scores cannot be
What are the properties of z-scores in Norm-Referenced Linear Standard Scores?
- Tell us the distance between a group’s mean and any specified raw score value
- z-scores involve the transformation of raw scores into standard scores, which are then related to the normal distribution
- If a z-score is found for each examinee in the group, the mean z-score will be 0 and the SD will be 1
What are the Advantages & Disadvantages of z-scores in Norm-Referenced Linear Standard Scores?
Advantages:
- They permit an understanding of an individual’s score in relation to other test-takers
- They allow for comparison across different tests
Disadvantages:
- Half of all z-scores are negative (which can be viewed as undesirable)
- All z-scores are expressed to one or two decimal places
Other linear standard scores have been designed to eliminate the decimal point and obtain smaller units & eliminate negative values (e.g. T-Scores)
What are the properties of T-scores in Norm-Referenced Linear Standard Scores?
*The T-Score is one of the most common linear standard scores
The Rationale is the same as the z-score except Mean = 50 & SD = 10
*The T-Score has many of the same advantages & disadvantages as the z-score
*It is less useful than the z score for certain research purposes, but is more convenient to interpret as there are no negative values
T = 10z + 50
What are the properties of Deviation IQ scores in Norm-Referenced Linear Standard Scores?
Weschler IQ Scores
- These tests of intelligence yield 3 scores: a verbal IQ, Performance IQ and a Full Scale IQ score
- have a Mean of 100 & SD of 15
- Formula for this IQ score is: IQ = 15z +100
Stanford-Binet IQ Scores
- Revised from Ratio IQ to Deviation IQ in the 1960’s
- have a Mean of 100 & SD of 16
- Formula for this IQ score is: IQ = 16z +100
What are the properties of “Rank Within Group” scores in Norm-Referenced Scores?
- Unlike Linear Standard Scores, “rank within group” scores are based on the number of people with scores higher or lower than a specified score value (i.e. I came first)
- Unlike Linear Standard Scores, “rank within group” scores must not be averaged
- Examples of Rank within Group scores are:
- Ranks
- Percentile ranks
- Decile ranks
- ENTER score derived from VCE
What are the Advantages & Disadvantages of “Rank Within Group” scores in Norm-Referenced Scores?
Advantage:
Some of these scores (Especially normalised standard scores) have the effect of creating a distribution that is nearly more normal than the actual distribution of obtained raw scores (Lyman, 1998)
Disadvantage:
Information about a test-taker distance from the Mean is lost
What are the properties of “Normalised Standard Scores” in Norm-Referenced Scores?
*Normalised Standard Scores are derived from scores that are assigned standard-scores-like values but are computed from percentile ranks
*With Linear Standard Scores (e.g. z-scores, T-scores, Deviation IQs) the shape of the distribution of raw scores is faithfully reproduced
*Normalised standard scores have the property of making a distribution that is a closer approximation of the normal probability distribution
*Examples are:
-T-Scaled scores (like T-Scores)
-Stanine scores
“Standard score with nine units’
Mean = 5; SD = 2; formula: 2z + 5
-Sten Scores