L4: Modern test theory Flashcards
what are the pros of classical test theory?
- intuitive & easy to apply:
- its in SPSS & easy to do in excel
- no large sample sizes/ many items needed
- can easily calculate reliability from it
what are the cons of classical test theory?
- focus on the test, not the items
- test properties depend on the pop. (ex: reliability & difficulty of a test)
- person properties depend on the test (ex: sum score is higher if the test is easy and lower if the test is difficult)
what does modern test theory do?
it specifies a measurement model in which we mathematically link the item scores to the construct aka latent variable / trait / factor
what are the assumptions of modern test theory?
unidimensionality: you only measure 1 construct
what is item response theory?
specific theory within modern test theory about how the construct measured is related to its items using 1 of 4 models that all have an Item Characteristic Curve
what are the 4 item response theory models?
- 1 parameter/ Rasch Model: shows item characteristic curve
w probability of getting item correct in relation to the level of construct you have (using item difficulty (beta)) - using a 2 parameter logistic model (using item discrimination (alpha, steepness of the curve) as well as item difficulty)
- using a 3 parameter logistic model (using guessing value (Ci), as well as item discrimination (alpha, steepness of the curve) as well as item difficulty)
- graded response model for likert scales
define item characteristic curve
graphical representation that shows prob of answering an item correctly across different levels of the trait
what can you do with the 4 models of item response theory?
- scale analysis (analyze how good the difficulty (medium), discrimination (high), guessing etc) the items are based on the graph) & test construction & improvement (get item info from item characteristic function) -> select items that provide most info
- person fit: identifies ppl whose response patterns do not fit hte expected model which could indicate guessing or inconsistent responding
- study item fairness (differential item functioning, DIF)
- computerized adaptive testing CAT
what is the item information function?
a function that shows at what point in the curve of the item characteristic curve we get the most info
all the item functions together show the scale info function
what is the scale information function? what is it useful for?
it shows all the info you get from all item info functions
useful when making a
- norm referenced test (when u compare score to population, ex; intelligence test), cause then you need to make sure that your scale info covers whole construct scale of population. so the function will be flat at prob = 1
- criterion referenced test (wehn u determine if someone passes certain cut off ex: exam) cause then u need to see all the info around the cut off
how can you study item fairness with item response theory?
see if theres differential item functioning (item works differently for different groups, normally groups should not have a mean difference in an item score)
like a specific question in a stat exam that references the dutch education system is harder for non dutch ppl to answer
mean difference in trait doesnt matter cause u consider 2 ppl of some level on trait and see how likely to are to respond correctly
- guessing is not commonly considered as a source of DIF
what is computerized adaptive testing?
its an algorithm that calculates what items are most relevant to you (where prob of correct response is 50%) from an item bank aka creates a test thats adapted to someones ability on the latent variable
(cause often theres many items that are either way too easy or way too hard, so doing them doesnt give you any extra info)
correct? -> more difficult item
incorrect? -> easier item
continue until estimate for latent trait doesnt change anymore
what are the pros of item response theory?
- population (latent trait) and test (item difficulty, discrimination) are independent
- focus on individual items
what are the cons of item response theory?
- statistically complex
- needs large samples
compare classical test theory to modern/item response theory
Classical test theory
* Sum score is taken as the construct score
* Focus on test
* Reliability of a test
Item Response Theory
* Score on latent variable is construct score
* Focus on items
* Difficulty / discrimination
* Item/Scale information
* Computerized Adaptive Testing (CAT)
what are the 4 key factors affecting responses to test items?
- Respondent Trait Level: ex: ppl w higher levels of the trait more likely to answer difficult items correctly.
- Item Difficulty: how challenging an item is. In IRT, difficulty refers to the point on the trait continuum at which an individual has a 50% chance of answering the item correctly.
- Item Discrimination: how well an item distinguishes between individuals with different levels of the trait.
- Guessing: particularly important in multiple-choice tests, where respondents might guess the correct answer without knowing it
In an IRT analysis the item difficulty of item 9 is found to be 0.78. What can you conclude?
People with a trait level of 0.78 have a 50% chance of answering item 9 correctly
What are differences between the GRM on the one hand and the 1PL, 2PL and 3PL on the other hand?
- The GRM is designed for polytomous items, the 1PL, 2PL and 3PL model for dichotomous items
check_box - The GRM estimates multiple difficulty parameters per item, the 1PL, 2PL and 3PL model only one
If you have 16 items that are scored on a 5-point Likert scale. How many difficulty parameters does a GRM model include for each item? Give a round number
4
In a 2PL with 15 items. How many parameters are there in total? Give a round number
30
In a 1PL with 20 items. How many parameters are there in total? Give a round number
20
In a 3PL with 25 items. How many parameters are there in total? Give a round number
75
In a GRM with 10 items that use a 5-point Likert scale. How many parameters are there in total? Give a round number
50
What is the difference between the ICC’s of the GRM and the ICC’s of the 1PL model?
- The ICC’s of the 1PL model are based on the difficulty parameters, while the ICC’s of the GRM are based on difficulty and discrimination parameters
- In the 1PL model there is only one curve per item, in the GRM there are multiple curves per item
What is the problem with a test that only has easy items, with regard to test information?
The test provides little information at high trait levels, since the items do not discriminate well among people with high trait levels
In a 2PL, what can you conclude about the item information if an item has a difficulty of 0.65?
At a trait level of 0.65, the item provides the most information
For a norm referenced test, the following statements about the applications of IRT correct?
1. Creating tests with good psychometric quality involves selecting items with average difficulty and high discrimination
2. Creating tests with good psychometric quality involves checking if there are items that function differently in different groups of people
only 2 is correct
what are the steps of CAT?
- item properties are obtained from large amount of items
- respondents are presented w items w average difficulty
- respondents are presented w items adapted to their trait level
- the computer makes a solid estimation about the respondents trait levels
What is the main benefit of using a CAT instead of a conventionnal test?
a cat requires administration of fewer items