Item Response Theory (Item Analysis) Flashcards

1
Q

Classical Test Theory

A

➢ So far, we have focused on the main principles of Classical Test Theory (CTT)
❖ Observed scores are a function of true psychological differences + measurement error
❖ Aims to understand & improve the reliability of psychological tests
➢ CTT is one of the most common approaches used in psychometrics
❖ Used to build a good measurement instrument
➢ Generally, CTT focuses on the overall scale/ test score
❖ Alternate Forms & Test-Retest Reliability
o They look at consistency between two different test scores
❖ Convergent & Discriminant Validity
o Considers patterns of correlations between scores of different variables
➢ CTT does also consider item analysis to be important
❖ Internal Consistency of Items (Reliability)
❖ Factor Analysis/ Item Loadings (Validity)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classical Test Theory (Limitations)

A

➢ There are a few limitations with CTT
1. Only used to analyse the performance of one sample on one measurement
❖ If different groups take a test, then we cannot directly compare the psychometric properties
2. Estimates of reliability are heavily influenced by error of measurement
❖ CTT basically tries to estimate how much error impacts scores
3. CTT assumes that all items on a test make an equal contribution to the test scores
Example:
➢ If we conclude that a measure has 7 items which reflect a construct
➢ CTT assumes that all 7 items contribute equally to the overall test score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Item Analysis

A

➢ An alternative way to understand tests is via item analysis
➢ Item analysis is the evaluation of individual items within the context of a test
❖ The effectiveness of items in tests
❖ Assesses the quality of items for the test as a whole
➢ Item analysis therefore focuses on item scores
and their relation to the scale they are assigned
➢ Item analysis is valuable for a number of reasons :
❖ Selecting the most appropriate items
❖ Rejecting inappropriate or misleading items
❖ Modifying the structure of the items
❖ Improving items which will be used again in later tests
➢ Classical item analysis includes two key principles:
❖ Item Difficulty
❖ Item Discrimination

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Item Difficulty

A

➢ Item Difficulty
❖ An item with high difficulty is less likely to be answered correctly/highly
❖ Compared to an item with relatively low difficulty
Example: Maths Test
❖ “What is 2+2?” – likely to be answered correctly/ highly
❖ “What is the square root of 10,000?” – harder difficulty, less likely to be answered correctly
Example: Psychological Attribute (e.g., extroversion)
❖ “I enjoy having conversations with friends”
❖ “I enjoy speaking in front of large audiences”
➢ Item difficulty estimates the level of ability/ attribute needed to “pass” an item
❖ Measured by calculating the proportion of individuals who answer an item correctly
o i.e., How many people score highly on an item (or correctly)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Item Difficulty Index

A

➢ Item difficulty is indicated by an item difficulty index
❖ The proportion of times that an item was answered correctly (or highly)
➢ Often presented as a percentage of those who answered an item correctly
❖ Ranges from 0 to 100 (i.e., 70% of people answered the item correctly)
❖ The higher the value = the easier the question (more people got it correct/scored high)
➢ This percentage often converted into a statistical value - Proportion Correct (P value)
❖ This P value is displayed as a number from 0 to 1
❖ When multiplied by 100, the P value converts back to the percentage
o P value = .45 would mean 45% answered correctly
➢ Item difficulty is relevant for determining who has high ability/ attribute levels
❖ Helps discriminate between those with high or low trait levels
o An item will have low discrimination if so difficult that almost everyone gets it wrong
o An item will have low discrimination if it so easy that almost everyone gets it right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Item Discrimination

A

➢ Item Discrimination
❖ The degree an item can differentiate between high trait levels and low trait levels
o The extent high scores on an item/ question relate to high scores overall test scores
➢ A positive item discrimination = item is consistent with trait being measured
❖ The larger the discrimination value = the stronger the link between the item & trait
➢ A negative item discrimination = item is oppositely linked to the trait
❖ An item discrimination of 0 = item has no relation to the trait

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Item Discrimination Index

A

➢ Item discrimination is reflected by a discrimination index (D)
➢ A discrimination index indicates how discriminating items are
❖ How well an item distinguishes those who are knowledgeable and those who are not
❖ How well an item can differentiate between good candidates and less able ones
➢ Example: Comparing item scores between high and low scoring respondents
❖ Identify two groups of higher and a lower performers on the overall test
❖ Subtract the number who got the item correct in the lower group from those in the higher group
❖ Then divide by the number of students in each group
➢ The discrimination index value for an item ranges from -1 to +1
❖ Closer to 1 = more discriminating the item is (better at identifying high and low scorers)
❖ The closer to 0 = less discriminating among high and low performers
❖ A discrimination index of 0.3 or greater is normally considered highly discriminating
➢ A negative discrimination value = issue with the item
❖ Those with high test scores, less likely to score high on the item
❖ Perhaps item has not been reversed scored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Item-Total Correlations

A

➢ Tests for internal consistency can also indicate item discrimination
❖ When testing Cronbach Alpha
➢ These are known as corrected item-total correlations
❖ Item-total correlations above 0.30 suggests the item discriminates well
➢ May indicate a problematic item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Item Response Theory (IRT)

A

➢ Item Response Theory (IRT) is alternative psychometric approach to CTT
❖ IRT is an extension of item analysis (places the emphasis on the items)
➢ IRT assumes that an individuals’ response to a particular item is influenced by:
❖ Qualities of the individual (i.e., test-taker)
❖ Qualities of the item(s)
➢ IRT takes into account that some items may be more difficult than others
❖ The probability of success on items is due both:
o Individual ability (or individual traits)
o The item difficulty
➢ IRT produces information that is often considered superior to CTT as it relates to:
❖ Individuals
❖ Items
❖ Overall Test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Item Response Theory (Benefits)

A

➢ IRT is considered more sophisticated but also complex
❖ Mostly employed in cognitive assessment
o Also known as the latent response theory
➢ Involves mathematical models
❖ Tests relationships between latent traits (unobservable) & their observed outcomes
➢ Provides maximum amount of information about items
❖ Helps indicate what respondents know and what they can answer (i.e., different ability levels)
o May indicate that one person has a certain level of knowledge,
but another person exceeds that knowledge level
➢ Identifies the best items to assess abilities which can be used to set benchmarks
❖ Identifies criteria that individuals need to reach a next level (of performance)
➢ Can identify which items have the correct level of difficulty for a sample
❖ Items that don’t discriminate individual ability either revised or removed
❖ Produces a better overall measure and reduces measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Individual (Trait) Characteristic

A

➢ An aspect that impacts an individual’s ability to answer an item correctly/ highly
❖ Their level of the psychological trait being measured by the item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Item Characteristics: Item Difficulty

A

➢ Item Difficulty
❖ An item with high difficulty less likely to be answered correctly/highly
Example: Maths Test
1. “What is 2+2?” – more likely to be answered correctly
2. “What is the square root of 10,000?” – harder difficulty = less likely to be answered correctly
➢ Question 2 has harder difficulty
❖ Requires much higher ability/ trait to be able to answer correctly
➢ Thus, item difficulty is connected to the individual’s trait level characteristic
❖ Some items require greater ability
➢ Trait levels & item difficulties are normally standardised in IRT
❖ The mean value = 0 (positive scores = above average; negative scores = below average)
o Item difficultly = 0 - an average trait level will offer a 50% chance of being correct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Item Characteristics: Item Discrimination

A

➢ Item Discrimination
❖ The degree an item can differentiate between high trait levels & low trait levels
o The extent high scores on an item relate to high scores on the overall test
➢ How well an item helps identify those who are knowledgeable/ high in an attribute
❖ & those that are not
➢ A positive item discrimination value = item is consistent with trait being measured
❖ Larger discrimination value = stronger link between the item and attribute
o Looking for values above .03
➢ A discrimination value of 0 = item has no relation to the trait
➢ A negative item discrimination = item is oppositely linked to the trait
❖ Those scoring high on the item are actually likely to score low on the test
o May be a problematic item

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Item Characteristics: Guessing

A

➢ Guessing may also occur on some tests
❖ Respondents answers some questions correctly/ highly simply by chance
➢ Guessing is particularly prominent on:
❖ Multiple Choice Quizzes (MCQ)
❖ True/ False Questions
➢ IRT includes a guessing component (only in some models)
❖ The probability that items may be answered correctly purely by chance
➢ The guessing component is influenced by how many answers are possible:
❖ On true/ false questions = 50% chance that someone could guess correctly
❖ Multiple choice with 4 answer options = 25% chance someone could guess correctly
➢ Guessing more applicable to tests of knowledge, skills, ability, & achievement
❖ Rather than measures of psychological attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Mathematical Models

A

➢ IRT proposes various mathematical models to identify:
❖ The probability that a person with a particular trait level will respond to an item
o Accounts for the components that affect a person responding to an item
➢ Used to create scores, evaluate items, & design tests
❖ Mathematical links between characteristics of items, people, and response likelihood
➢ These models differ in two ways:
1. The number of item characteristics (parameters) they account for
❖ Item Difficulty
❖ Item Difficulty + Item Discrimination
❖ Item Difficultly + Item Discrimination + Guessing
2. The response format they are suitable for
❖ Some models only useful for dichotomous/ binary responses (yes/no; true/ false)
❖ Other models can be used for more than two answer options (Likert scales)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Types Of Models

A

➢ There are 4 specific models proposed within IRT
1. One-Parameter Model (only for binary responses [two options])
❖ Individuals’ trait level characteristic
❖ One item characteristic – item difficulty
2. Two-Parameter Model (only for binary responses [two options])
❖ Individuals’ trait level characteristic
❖ Two item characteristics – item difficulty & item discrimination
3. Three-Parameter Model (only for binary or limited responses options)
❖ Individuals’ trait level characteristic
❖ Three item characteristics – item difficulty, item discrimination, & guessing
4. Graded Model (only for more than two answer options [Likert scales])
❖ Individuals’ trait level characteristic
❖ Three item characteristics – item difficulty, item discrimination, & guessing
❖ Probability of specific responses

17
Q
  1. One-Parameter Model
A

➢ This model is the simplest in IRT models (Rasch Model)
❖ Only suitable for responses to binary/ dichotomous items (i.e. true or false)
➢ Assumes an individuals’ response to a dichotomous item is determined by:
❖ Individuals’ trait level characteristics
❖ One parameter (item characteristic) = item difficulty
➢ Probability a person with a certain trait level
will answer an item of a certain difficulty correctly (highly)

18
Q

Item Characteristic Curve (ICC)

A

➢ Item Characteristic Curves (ICC) help graph probabilities to trait levels
➢ Reflects the probabilities that different trait levels will answer items correctly
❖ Each item has it’s own curve
❖ X-axis reflects different average trait levels
❖ Y-axis reflects the probability they will get it correct
➢ 0.0 trait level = the average trait ability
❖ How a person with average trait levels will answer items
➢ Item 1 the easiest here
❖ Average trait levels has a 85% chance of being correct
➢ Item 5 the most difficult
❖ Average trait levels has a 17% of being correct
➢ Items further right and lower in trajectory are generally harder

19
Q
  1. Two-Parameter Model
A

➢ This model is more complex & still only suitable for binary responses
➢ Assumes an individuals’ response to a dichotomous item is determined by:
❖ Individuals’ trait level characteristics
❖ Two parameters (item characteristic) = item difficulty & item discrimination
➢ Probability a person with a certain a trait level
will answer an item of a certain difficulty & discrimination correctly (highly)

20
Q

Item Characteristic Curve
(Two Parameters)

A

➢ The discrimination parameter is allowed to vary between items
❖ The curves of different items can intersect and have different slopes
➢ The steeper the slope = the better discrimination the item has
❖ Detects subtle differences in the ability of the respondents
➢ The solid black curve (Item 1)
❖ Is the easiest item
❖ But doesn’t discriminant well (shallow curve)
o Relatively low and high trait levels can get it correct
➢ The steeper dotted curve (Item 4)
❖ Is a harder item
❖ But is discriminates well (steeper curve)
o Lower trait levels unlikely to get it correct
o Higher trait levels much more likely to get it correct

21
Q
  1. Three-Parameter Model
A

➢ This model now incorporates a guessing component
❖ Only suitable for binary or limited response items
o e.g. Multiple choice options with only one correct answer
➢ Assumes an individuals’ response to a dichotomous item is determined by:
❖ Individuals’ trait level characteristics
❖ Three parameters = item difficulty, item discrimination, & guessing

22
Q
  1. Graded Model
A

➢ Graded (response) models can handle tests with two or more response options
❖ This includes the specific order or ranking of responses
➢ Example: Responding to the question “I like having conversations with friends”
❖ 5 response options (Strongly Disagree – Strongly Agree)
➢ Indicate the probability a person with average trait levels will select each response
1. Disagree or higher (.98 [98%])
2. Neutral or higher (.79 [79%])
3. Agree or higher (.26 [26%])
4. Just Strongly Agree (.03 [3%])
➢ This will be modified depending on each item’s difficulty and discrimination

23
Q

Model Fit Indices

A

➢ IRT models provide various psychometric information relating to:
❖ Different item scores
❖ Overall test scores
➢ However, we need confidence that a model fully represents the actual responses
Example: Imagine an individual answered 10 questions in the following pattern
❖ Answered the 4 easy questions correctly
❖ Answered the 4 moderate questions incorrectly
❖ Answered the 2 difficult questions correctly
o Answering the easy but not the moderate questions suggests low trait/ ability levels
o But answering the difficult questions suggests high trait/ ability levels
➢ This response pattern does not really fit with an IRT model and interpretation
❖ Poor model fit = we need to cautious with the findings
❖ Good model fit = we can proceed with our interpretation

24
Q

Value Of IRT

A

➢ IRT models can be applied in many settings to develop tests
❖ Education (very useful for academic ability tests)
❖ Cognitive/ Forensic Psychology
❖ Health outcomes/ research
❖ Business tools
➢ IRT can be used to design and modify scales/measures
❖ Identify items with high discrimination & add precision to the measurement
❖ Help remove problematic items & reduce overly long questionnaires/ tests
➢ IRT can also compare items from different measures (if measuring the same construct)
❖ Classical Test Theory cannot do this

25
Q

Applications Of IRT

A
  1. Test Development & Improvement
    ❖ Identify the most important & effective items to discriminate individuals
    ❖ Remove items that are too easy, difficult, or ineffective
    ❖ Improve the overall test accuracy
    o Fraley et al. (2000) used IRT to examine 4 inventories to improve test quality
  2. Differential item functioning
    ❖ Evaluate the nature of different items functioning (do they behave differently)
    ❖ Does the difficulty level of certain items differ between groups (i.e., age groups)
  3. Person Fit
    ❖ Can help identify individuals with unexpected response patterns
    o Can identify patterns of cheating, random responding, or scoring errors
  4. Computerised Adaptive Testing
    ❖ Help provide accurate & efficient assessments for computer-based tests
    o Procedures of IRT can be embedded with computerised tests