Lecture 4 Test Development 2: Item Analysis & Score Meaning (Catherine) Flashcards

To Provide a Summary of Lecture 4: Item Analysis & Score Meaning for the purposes of revision

1
Q

What are the stages of Test Development?

A
  1. Test Conceptualisation
  2. Test Construction
  3. Test Try Out
  4. Item Analysis
  5. Test Revision
  6. Repeat stages 4 & 5 as required
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What constitutes a ‘Good’ Item?

A
A Good Item has:
*Content Validity
*Criterion-Related Validity
*Test-Retest Reliability
*Internal Consistency & Reliability 
(just like a good test) It also needs to
*differentiate between test-takers & have
*good discrimination (high scorers score well, low scorers score poorly on the item)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What constitutes a ‘Bad’ Item?

A
A Bad item lacks differentiation between test-takers & has poor discrimination 
(low scorers score well, high scorers score poorly on the item)
The item requires revision or deletion
It also has poor: 
*Content Validity
*Criterion-Related Validity
*Test-Retest Reliability
*Internal Consistency & Reliability 
(just like a good test)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How are good items identified?

A

*Good items are identified through the process of item analysis which involves the analysis of individual items & overall test-scores

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is item analysis important?

A

It allows a test developer to identify items that do not perform well, which can be revised or discarded, thus improving the reliability & validity of the test
*The test also needs to be brief to prevent test fatigue, thus it is important to have only the best items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does item analysis identify good items?

A

Item analysis is typically quantitative although qualitative approaches are also possible such as expert panels & “think aloud” exercises

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What determines the type of item analysis implemented by the test developer?

A
  • The analysis will be dependent on the test developers objectives i.e. the purpose of the test
  • Some Test Developers what to create ‘clusters’ or item scales which hang together leading to emphasis on internal consistency & construct validity as the cluster is measuring the same phenomena
  • Some test developers may want to create a test that predicts some criterion (e.g job performance) & may not be so concerned with how well clusters of items hang together (emphasising criterion-related validity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the analytic tools that Test Developers use to analyse and select items, & which are the most important?

A
  • Item-difficulty index
  • item-discrimination index
  • item-validity index
  • Item-reliability index

Item Characteristics of particular interest are item-difficulty & item-discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the key ways of calculating the Item-difficulty index

A
  • The proportion of test-takers who answered the answer correctly (p)
  • p can range from 0 (no-one answers correctly) to 1 (everyone answered correctly)
  • Each item on a test has a corresponding p value (item 1 = p1; item 2 = p2)
  • This statistic is also referred to as the ‘item endorsement index” in non-achievement tests (e.g. personality tests)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give a few quick examples to emphasis how simple it is to calculate the item-difficulty index

A
  1. if an item is answered corrected by 50 out of 100 (50% of) people p = 0.5
  2. if an item is answered corrected by 75 out of 100 (75% of) people p = 0.75
  3. if an item is answered corrected by 95 out of 100 (95% of) people p = 0.95
    Thus Item 1 is harder than item 2, and item 2 is harder than item 3
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the ‘Ideal’ level of item difficulty for a test as a whole and how is it calculated?

A
  • The optimal average item difficulty is 0.5 with individual items ranging from 0.3 to 0.8
    (0. 3 = somewhat difficult to 0.8 = somewhat easy)
  • Items that no-one answers correctly (p = 0) or everyone answers correctly (p = 1) do not discriminate between test-takers
  • The Index of item difficulty for a test as a whole is calculated as the average of all the p-values for the test items
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When calculating the ‘Ideal’ level of item difficulty for a test as a whole what also needs to be taken into account when analysing items that use the selected-response format?

A

*The effect of guessing needs to be taken into account when analysing items that use the selected-response format
*Optimal item difficulty is the mid-point between 1 and the probability of guessing the answer correctly:
*True-false items (2 options)
the probability of guessing correctly = 0.5
therefore the optimal average item difficulty is p = 1 + 0.5 /2 = 0.75
*Multiple-Choice items (5 options)
the probability of guessing correctly = 0.2
therefore the optimal average item difficulty is p = 1 + 0.2 /2 = 0.60
*Multiple-Choice items (4 options)
the probability of guessing correctly = 0.25
therefore the optimal average item difficulty is p = 1 + 0.25 /2 = 0.625 = 0.63

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Item-discrimination index?

A
  • The Item-discrimination index is the degree to which an item differentiates correctly on the behaviour the test is designed to measure
  • An item is a good item if:
  • Most of the high scorers on the test overall answer the item correctly
  • Most of the low scorers on the test overall answer the item incorrectly
  • An item is not doing its if it is more likely to be answered correctly by test-takers who least understand the subject matter than those who most understand the subject matter.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the key properties of the Item-discrimination index?

A
  • Symbolised by d
  • Compares performance on a particular item by the high ability group & the low ability group
    (i. e. the top 27% and the bottom 27%)
  • Items that discriminate well will have a high positive score (to a maximum of 1)
  • A negative d value is a red flag as it means low test takers are doing better on that item than high test takers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the key ways of calculating the Item-discrimination index?

A

The difference between:
*The no. of high scorers answering the item correctly (U)
*The no. of low scorers answering the item correctly (L)
*Divided by the number of scores in each group (n)
d = U - L / n
If the distribution is normal then the number in U will equal the number in L. Irrespective of the distribution, the division of U and L by the number in each group will yield the same result

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the key points to remember about Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?

A

*They are graphic representations of item difficulty & item discrimination
*ICCs have the following characteristics:
-Ability to be plotted on the x-axis & probability of correct response plotted on the y-axis
-The steeper the slope the greater the discrimination between high & low scorers
-Item-difficulty is reflected in the shift of the ICC along the x-axis
The more difficult an item is, the more the ICC shifts to the right as fewer people have answered the item correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe the Item-Characteristic Curves (ICCs) associated with the Item-discrimination index?

A

a Bad item will have:

  • a negative slope or
  • an inverted U (or n shaped) curve

A good item will have:

  • a positive slope
  • or an inverted L (ONLY if pass/fail criteria)
18
Q

Give examples of good and bad items on the Item-discrimination index

A
  1. U=20, L=16, U-L=4, n=32, d=0.13 revise
  2. U=30, L=10, U-L=20, n=32, d=0.63 keep
  3. U=32, L=0, U-L=32, n=32, d=1.00 keep
  4. U=16, L=16, U-L=0, n=32, d=0.00 revise
  5. U=0, L=32, U-L=4, n=-32, d=-1.00 discard
19
Q

What are the key points in relation to the Item-Validity Index?

A
  • The Item-Validity index provides an indication of the degree to which a test measures what it purports to measure
  • This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score on the criterion measure (riT)
  • The Item-Score is calculated using the item’s item-difficulty index score (pi).
  • We can use information about the item difficulty to work out the item’s validity
  • IF the item correlates with what it is supposed to measure e.g. job performance, then it has good content validity, if it does not have a good correlation, then it is low in content validity and prone to removal
20
Q

What are the key points in relation to the Item-Reliability Index?

A
  • The Item-Reliability index is used to provide an indication of the internal consistency of a test, which helps us to determine whether all items consistently measure the same construct
  • The selection of items that are associates most strongly with the total test score helps to increase a test’s internal consistency reliability
  • There are 2, inter-related, ways to approach this:
  • Calculate an item’s reliability index
  • calculate the internal consistency reliability of the test/scale (Cronbach’s alpha)
21
Q

How does one calculate an item’s reliability in relation to the Item-Reliability Index?

A
  • This index is equal to the product of the item-score standard deviation (Si) & the correlation between the item score and the total test score: Si riT
  • Again, the item-score standard deviation of an item (denoted by Si) is calculated using the item’s item difficulty index score (pi)
22
Q

How does one calculate the test’s internal consistency reliability (AKA Reliability of Scales) in relation to the Item-Reliability Index?

A
  • We are primarily interested in the reliability of clusters of items (i.e. of a scale or a test as a whole)
  • measures of inter-item consistency are useful in examining whether a scale is homogeneous (all items are measuring the same thing)
  • For a dichotomous response scale (true-false) use the Kuder-Richardson formula 20
  • For a non-dichotomous response scale (Likert) use Cronbach’s Alpha
23
Q

What are the key points in relation to Factor Analysis and Inter-Item consistency?

A
  • Factor Analysis is a statistical tool useful in determining whether items on a test appear to be measuring the same thing
  • Through Factor Analysis, items that do not load on the factor they were written to tap can be revised or eliminated
  • If tto many items appear to be tapping a partucular area, the weskest of such items can be eliminated/removed
  • Factor Analysis is not the same as cluster analysis
24
Q

What are important issues when considering the meaning of test scores?

A
  • The types of tests used
  • Maximum-Performance tests (exams)
  • Typical-Performance tests (personality tests)
  • The intended use of the test
  • do we want to understand performance relative to some objective standard (Criterion -referenced test - driving licence)
  • do we want to understand performance relative to other people (Norm-referenced test - IQ test)

*How test results are expressed (which is dependent on the purpose of the test)

25
Q

What different types of tests are there?

A
  • Maximum-Performance tests (exams)
  • Test-takers do their best work, so their ability either achieved or potential is being tested
  • Achievement tests - measure what we have learned
  • Aptitude tests - measures what we are capable of
  • Used to gain admission to university or employment
  • Typical-Performance tests (personality tests)
  • Focus is on what test-takers actually do or what they are really like rather than what they are capable of
  • Exemplified by personality tests, which have no correct or incorrect answers
26
Q

How do we make meaning of test scores?

A

Depends on the intended use of the test:

  • do we want to understand performance relative to some objective standard
  • Criterion -referenced test - e.g.driving licence
  • do we want to understand performance relative to other people
  • Norm-referenced test - e.g. IQ test

The purpose of the test informs how test results are expressed

27
Q

How do we make meaning of test scores with Criterion -referenced tests (we want to understand performance relative to some objective standard, e.g. driving licence)?

A
  • Scores are expressed in terms of the absolute level of knowledge, skills, or ability achieved (i.e. mastery) NOT in comparison with other people
  • Mastery does not constitute “complete” knowledge, but rather a high level of knowledge of a given area, e.g. 80% / HD
  • Criterion Referenced testing is common in assessing achievement in educational settings
  • AKA: Domain, content or objective reference tests
28
Q

What are some of the advantages & Disadvantages of Criterion -referenced tests?

A
  • We are able to compare test-takers with an “absolute standard” of perfection
  • Examples of test scores are: % correct and letter grades (A-E; HD-P)
  • Key advantage: these scores tell us about the test-taker’s knowledge of the content of a domain of interest
  • Key Disadvantage: There is no way of illustrating the scores in a generalised fashion
29
Q

How do we make meaning of test scores with Norm-referenced test (where we want to understand performance relative to other people)?

A
  • This involves comparing the test-taker with similar others (i.e. how you sit relative to your peers)
  • Uses Norms : a known distribution of scores for some specified reference group
  • Norms are critical to understanding the meaning of test scores
  • if norms are out of date, a test-taker’s scores cannot be interpreted meaningfully
  • Norm-referenced tested used derived scores NOT raw scores
30
Q

Why are Norm-referenced tested used derived scores NOT raw scores?

A

Derived scores make test interpretation easier
*by ensuring scores from different tests are more comparable by expressing them in the same metric
*To help us make more meaningful interpretations of test scores
NB: *We must have accurate raw scores to get accurate derived scores
*No amount of statistical manipulation can make up for the use of a poor test or for making mistakes in scoring
-If a raw score has been normed, (e.g. converted to a z score) then we can compare it in relation to M and SD so can compare our test with another test.

31
Q

What has been proposed to remedy some of the problems with norm-referenced scoring?

A

Standardised scoring systems have been proposed to remedy some of the problems with norm-referenced scoring

32
Q

What are the 5 types of Standardised scoring within norm-referenced scoring?

A

*Linear Standard Scoring
*Rank within Groups
*Normalised Standard Scores
also
*Range of scores within a group
*Status of those obtaining the same score

These scores are relatively independent of content difficulty because they base a test-taker’s score on the performance of other people in the comparison group

33
Q

What are the main properties of Linear Standard Scoring?

A
  • These scores are based on the Standard Deviation
  • They have properties making them more valuable in research than most other derived scores
  • All Linear standard scores tell us the location of an examinee’s raw score in relation to the mean of some specified group & in terms of the group’s standard deviation
  • Examples of Linear Standard Scoring are:
  • z-scores
  • T-scores
  • Deviation IQ scores
34
Q

What are the properties making Linear Standard Scores more valuable in research than most other derived scores?

A
  • For every test & group each score gives the same mean and SD
  • These scores retain the shape of the raw distribution, changing only the calibration or metric
  • They permit inter-group and inter-test comparisons that are not possible with most other types of score
  • They can be treated mathematically (e.g. averaged) in ways that other scores cannot be
35
Q

What are the properties of z-scores in Norm-Referenced Linear Standard Scores?

A
  • Tell us the distance between a group’s mean and any specified raw score value
  • z-scores involve the transformation of raw scores into standard scores, which are then related to the normal distribution
  • If a z-score is found for each examinee in the group, the mean z-score will be 0 and the SD will be 1
36
Q

What are the Advantages & Disadvantages of z-scores in Norm-Referenced Linear Standard Scores?

A

Advantages:

  • They permit an understanding of an individual’s score in relation to other test-takers
  • They allow for comparison across different tests

Disadvantages:

  • Half of all z-scores are negative (which can be viewed as undesirable)
  • All z-scores are expressed to one or two decimal places

Other linear standard scores have been designed to eliminate the decimal point and obtain smaller units & eliminate negative values (e.g. T-Scores)

37
Q

What are the properties of T-scores in Norm-Referenced Linear Standard Scores?

A

*The T-Score is one of the most common linear standard scores
The Rationale is the same as the z-score except Mean = 50 & SD = 10
*The T-Score has many of the same advantages & disadvantages as the z-score
*It is less useful than the z score for certain research purposes, but is more convenient to interpret as there are no negative values
T = 10z + 50

38
Q

What are the properties of Deviation IQ scores in Norm-Referenced Linear Standard Scores?

A

Weschler IQ Scores

  • These tests of intelligence yield 3 scores: a verbal IQ, Performance IQ and a Full Scale IQ score
  • have a Mean of 100 & SD of 15
  • Formula for this IQ score is: IQ = 15z +100

Stanford-Binet IQ Scores

  • Revised from Ratio IQ to Deviation IQ in the 1960’s
  • have a Mean of 100 & SD of 16
  • Formula for this IQ score is: IQ = 16z +100
39
Q

What are the properties of “Rank Within Group” scores in Norm-Referenced Scores?

A
  • Unlike Linear Standard Scores, “rank within group” scores are based on the number of people with scores higher or lower than a specified score value (i.e. I came first)
  • Unlike Linear Standard Scores, “rank within group” scores must not be averaged
  • Examples of Rank within Group scores are:
  • Ranks
  • Percentile ranks
  • Decile ranks
  • ENTER score derived from VCE
40
Q

What are the Advantages & Disadvantages of “Rank Within Group” scores in Norm-Referenced Scores?

A

Advantage:
Some of these scores (Especially normalised standard scores) have the effect of creating a distribution that is nearly more normal than the actual distribution of obtained raw scores (Lyman, 1998)

Disadvantage:
Information about a test-taker distance from the Mean is lost

41
Q

What are the properties of “Normalised Standard Scores” in Norm-Referenced Scores?

A

*Normalised Standard Scores are derived from scores that are assigned standard-scores-like values but are computed from percentile ranks
*With Linear Standard Scores (e.g. z-scores, T-scores, Deviation IQs) the shape of the distribution of raw scores is faithfully reproduced
*Normalised standard scores have the property of making a distribution that is a closer approximation of the normal probability distribution
*Examples are:
-T-Scaled scores (like T-Scores)
-Stanine scores
“Standard score with nine units’
Mean = 5; SD = 2; formula: 2z + 5
-Sten Scores