Lecture 3: IRT Parameter estimation and model fit Flashcards
Why is model fit so important? (2)
Because we can check to see whether a model fits the data and if not, try another one. It’s the reason we focus on latent variables as opposed to CTT.
You should also because otherwise you’re using a model thats incorrect and once you’re using a model you should establish the model fit
What is the first thing you should look at when establishing model fit?
How many dimensions a model has, if it has more than one dimension then it is useless to use unidimensional (assumes one latent variable) IRT models
Name three other things to check when establishing model fit
- Equal Item discrimination
- Absence of guessing (1/2 vs 3 parameter model)
- Model predictions: Apply the model to data and see what it predicts for that data. If they agree, keep the model. If not you should look for a different model
In LVM you can’t calculate stuff like the slope and regression etc like in CTT. What do you calculate instead?
Maximum likelihood: You evaluate for which values of your model parameters the data is most likely to occur. So which values for parameters a1, a2, b1, b2, θ1, θ2 etc maximise the likelihood of your data? Aka what values are most likely given the data. It’s one of the most important estimations
What is the equation for calculating the likelihood of dichotomous models?
Pi =𝑃 (𝑋𝑝𝑖 =1 |𝜃𝑝) = 𝑒𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1+𝑒^𝑎𝑖 𝜃𝑝−𝑏𝑖
Qi =𝑃 (𝑋𝑝𝑖 =0 |𝜃𝑝) = 1 / 1+𝑒^𝑎𝑖 𝜃𝑝−𝑏𝑖
What two assumptions does maximum likelihood hold?
Uni-dimensionality: All items assess one construct
Local independence: Conditional on 𝜃, all of the items are independent or 𝜃 accounts for all the correlations among the items.
How do you calculate maximum likelihood?
E.g if a subject has a score of 1,0,0,1, we can calculate the probability of getting a score like:
𝑃(𝑋𝑝1 =1, 𝑋𝑝2 =0, 𝑋𝑝3 =0, 𝑋𝑝4 =1 | 𝜃𝑝)
= 𝑃1 ×𝑄2 ×𝑄3 ×𝑃4
= 𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1) x 1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2) x 1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2) x 𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1)
Which can be written as:
4⎾⏋i = 1, 𝑒^𝑋𝑝𝑖𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1 + 𝑒^ 𝑎𝑖(𝜃𝑝−𝑏𝑖)
By applying this to the whole dataset and tweaking the parameters, you get the likelihood of the data
What does the addition of Xpi in the maximum likelihood notation do?
4⎾⏋i = 1, 𝑒^𝑋𝑝𝑖𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1 + 𝑒^ 𝑎𝑖(𝜃𝑝−𝑏𝑖)
The addition of Xpi means that the equation changes depending on whether its a 1 or a 0. If its a 1 it becomes the probability of a correct response:
𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1)
If its a 0 the equation changes to the probability of an incorrect response:
1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2)
How does the assumption of local independence affect this equation?
Due to the assumption of local independence we can multiply the individual probabilities. If we did not have this assumption then the equation would be very long and hard to deal with as we would have to account for every association between the probabilities.
Describe the graph of the maximum likelihood function with one difficulty parameter. What can be derived from this?
Difficulty parameter on the x axis, likelihood parameter on the y axis. Typically is skewed to the left with a gradual slope. The highest point (maximum) of this curve can be taken to be the value with the maximum likelihood.
What values do the likelihood parameters on the y axis take?
They are extremely small values as they are the probability of exactly observing the data which was observed. For this reason we focus on the log likelihood which give larger values, this is a very important concept.
What is meant by the term assymptotic properties? WHat does this mean in a practical sense?
If your sample size approaches infinity, MI has these properties. For relatively small sample sizes this can be unlikely, however for larger sample sizes (e.g 200), these properties should hold
What three assymptotic properties does maximum likelihood have?
- The expected values of the estimates are the true values of the parameters (if you estimate the two parameter model, the expected value of your ML estimate is equal to the true value; means it is unbiased, it is not systematically incorrect)
• Maximum likelihood estimates (MLE’s) are denoted with a hat, e.g., ^𝑎1 for the estimate of 𝑎1
• Thus: E ^𝑎1 =𝑎𝑖 - The curvature at the maximum determines the standard errors of the estimates
• A wider curve means that a wider range of values have a high likelihood of being the best value for the parameter - The MLE’s have a normal distribution such that
^𝑎1~𝑁(𝑎1,𝑠𝑒 ^𝑎11 ) (the normal distribution has a mean of the true parameter value a1 and the standard deviation with be the standard error of your estimate)
What doe these three properties of MLE allow for?
Because of these properties you can do a simple significance test on a parameter; a test on 𝐻0:𝑎1 =𝜇 can be conducted using the so-called Wald test
z = (^a1 - 𝜇) / (se(^a1))
where z has a standard normal distribtion if 𝐻0 is true
Give a practical example on how carrying out a significance test with MLE could be useful
To see if a discrimination parameter is significantly larger than zero; because if it is not then it doesn’t measure the latent variable, the item characteristic curve will be flat if it doesn’t discriminate at all.
What three estimation approaches are very popular in item response theory?
Joint maximum likelihood
Conditional maximum likelihood
Marginal maximum likelihood
What is involved in joint MLE?
You estimate all ai, bi, ci, θ simultaneously
Joint maximum likelihood the seems like a good approach as you obtain all the estimates you require at once. What are some disadvantages of this method? (4)
This is a massive estimation as you need to do this for every parameter (of which there can be hundreds or even thousands), for every participant (of which there can be hundreds or even thousands)
Some properties of joint MLE also do not help in that, for instance, if all the items are correct, ^θ approaches infinity, if they are all incorrect, ^θ approaches -infinity. This may make sense intuitively but statistically it is a disaster as it is really hard to calculate and draw inferences from
This also holds for the items as if a person has all the items correct or incorrect, then ^𝑏𝑖 →∞or ^𝑏𝑖 →−∞.
It also has undesirable asymptotic properties as if your sample size approaches infinity then you will god a new θ parameter for each subject giving you infinity θ (latent trait) parameters. The more subjects should increase certainty but if you keep adding new subjects then this adds uncertainty. This means it essentially violates the properties of maximum likelihood since these assymptotic properties don’t hold.
How do the other two MLE estimations tackle the challeges posed with Joint MLE?
They do not require to estimate the latent trait values explicitly for each person separately. θ is no longer a parameter, all free parameters are the item parameters
What is involved in conditional maximum likelihood?
Conditional on the sum score, θ disappears. When you plug in the sum score for each person, the latent trait is cancelled out from the likelihood function and you only have the item parameters.
What are the disadvantages of the conditional MLE?
Its only applicable to the Rasch model (since there is no common discrimination parameter)
How do you calculate conditional MLE in R? What kind of parameter does this calculation use?
Using the package eRm:
res <- RM(data)
res
coef(res) #to get the parameters
Estimates item easiness bi:
𝑃 (𝑋𝑝𝑖 =1 | 𝜃𝑝) = e^𝜃p + bi / 1 + e^𝜃p + bi
(rather than difficulty parameter e^𝜃p - bi / 1 - e^𝜃p + bi)
(So the larger the value, the easier the item)
What is the advantage of using marginal MLE?
It is applicable to practically all IRT models
What is a disadvantage of using marginal MLE?
It assumes a normal distribution for θ, you need this normal distribution to marginalise out the latent variable; you sum over all the possible values for the latent variable in your likelihood function
How do you calculate marginal MLE in R?
Use package ltm in R (as well as many of the estimations for IRT)
library(ltm)
res1pl = rasch(data)
res1pl
In docs you’ll see a screenshot of conditional MLE output. Describe what information this output gives
The conditional log likelihood indicates the value of the likelihood of the data given these parameters at the maximum. The difficulty parameters give the difficulty parameter estimate for each item.
This gives the easiness parameters, the first beta listed is one of the easiest items with a high score and the last beta listed is one of the most difficult items with a low score (true only for this data).
In docs you’ll see a screenshot of marginal MLE output. Describe what information the coefficients give
You get a common discrimination parameter at the bottom since marginal likelihood assumes that the slopes are equal. The rest are normal difficulty parameters loosely listed from easier parameters to more difficult parameters (true only for this data) with lower scores indicating easier items
The log.Lik indicates the log likelihood; the value of the likelihood of the data given these parameters at the maximum.