Lecture 3: IRT Parameter estimation and model fit Flashcards

1
Q

Why is model fit so important? (2)

A

Because we can check to see whether a model fits the data and if not, try another one. It’s the reason we focus on latent variables as opposed to CTT.

You should also because otherwise you’re using a model thats incorrect and once you’re using a model you should establish the model fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the first thing you should look at when establishing model fit?

A

How many dimensions a model has, if it has more than one dimension then it is useless to use unidimensional (assumes one latent variable) IRT models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Name three other things to check when establishing model fit

A
  • Equal Item discrimination
  • Absence of guessing (1/2 vs 3 parameter model)
  • Model predictions: Apply the model to data and see what it predicts for that data. If they agree, keep the model. If not you should look for a different model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In LVM you can’t calculate stuff like the slope and regression etc like in CTT. What do you calculate instead?

A

Maximum likelihood: You evaluate for which values of your model parameters the data is most likely to occur. So which values for parameters a1, a2, b1, b2, θ1, θ2 etc maximise the likelihood of your data? Aka what values are most likely given the data. It’s one of the most important estimations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the equation for calculating the likelihood of dichotomous models?

A

Pi =𝑃 (𝑋𝑝𝑖 =1 |𝜃𝑝) = 𝑒𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1+𝑒^𝑎𝑖 𝜃𝑝−𝑏𝑖

Qi =𝑃 (𝑋𝑝𝑖 =0 |𝜃𝑝) = 1 / 1+𝑒^𝑎𝑖 𝜃𝑝−𝑏𝑖

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What two assumptions does maximum likelihood hold?

A

Uni-dimensionality: All items assess one construct

Local independence: Conditional on 𝜃, all of the items are independent or 𝜃 accounts for all the correlations among the items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate maximum likelihood?

A

E.g if a subject has a score of 1,0,0,1, we can calculate the probability of getting a score like:
𝑃(𝑋𝑝1 =1, 𝑋𝑝2 =0, 𝑋𝑝3 =0, 𝑋𝑝4 =1 | 𝜃𝑝)
= 𝑃1 ×𝑄2 ×𝑄3 ×𝑃4
= 𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1) x 1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2) x 1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2) x 𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1)

Which can be written as:
4⎾⏋i = 1, 𝑒^𝑋𝑝𝑖𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1 + 𝑒^ 𝑎𝑖(𝜃𝑝−𝑏𝑖)

By applying this to the whole dataset and tweaking the parameters, you get the likelihood of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the addition of Xpi in the maximum likelihood notation do?

4⎾⏋i = 1, 𝑒^𝑋𝑝𝑖𝑎𝑖(𝜃𝑝−𝑏𝑖) / 1 + 𝑒^ 𝑎𝑖(𝜃𝑝−𝑏𝑖)

A

The addition of Xpi means that the equation changes depending on whether its a 1 or a 0. If its a 1 it becomes the probability of a correct response:
𝑒^𝑎1(𝜃𝑝−𝑏1) / 1+𝑒^𝑎1(𝜃𝑝−𝑏1)
If its a 0 the equation changes to the probability of an incorrect response:
1 / 1+𝑒^𝑎2(𝜃𝑝−𝑏2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does the assumption of local independence affect this equation?

A

Due to the assumption of local independence we can multiply the individual probabilities. If we did not have this assumption then the equation would be very long and hard to deal with as we would have to account for every association between the probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the graph of the maximum likelihood function with one difficulty parameter. What can be derived from this?

A

Difficulty parameter on the x axis, likelihood parameter on the y axis. Typically is skewed to the left with a gradual slope. The highest point (maximum) of this curve can be taken to be the value with the maximum likelihood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What values do the likelihood parameters on the y axis take?

A

They are extremely small values as they are the probability of exactly observing the data which was observed. For this reason we focus on the log likelihood which give larger values, this is a very important concept.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is meant by the term assymptotic properties? WHat does this mean in a practical sense?

A

If your sample size approaches infinity, MI has these properties. For relatively small sample sizes this can be unlikely, however for larger sample sizes (e.g 200), these properties should hold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What three assymptotic properties does maximum likelihood have?

A
  1. The expected values of the estimates are the true values of the parameters (if you estimate the two parameter model, the expected value of your ML estimate is equal to the true value; means it is unbiased, it is not systematically incorrect)
    • Maximum likelihood estimates (MLE’s) are denoted with a hat, e.g., ^𝑎1 for the estimate of 𝑎1
    • Thus: E ^𝑎1 =𝑎𝑖
  2. The curvature at the maximum determines the standard errors of the estimates
    • A wider curve means that a wider range of values have a high likelihood of being the best value for the parameter
  3. The MLE’s have a normal distribution such that
    ^𝑎1~𝑁(𝑎1,𝑠𝑒 ^𝑎11 ) (the normal distribution has a mean of the true parameter value a1 and the standard deviation with be the standard error of your estimate)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What doe these three properties of MLE allow for?

A

Because of these properties you can do a simple significance test on a parameter; a test on 𝐻0:𝑎1 =𝜇 can be conducted using the so-called Wald test
z = (^a1 - 𝜇) / (se(^a1))
where z has a standard normal distribtion if 𝐻0 is true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give a practical example on how carrying out a significance test with MLE could be useful

A

To see if a discrimination parameter is significantly larger than zero; because if it is not then it doesn’t measure the latent variable, the item characteristic curve will be flat if it doesn’t discriminate at all.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What three estimation approaches are very popular in item response theory?

A

Joint maximum likelihood

Conditional maximum likelihood

Marginal maximum likelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is involved in joint MLE?

A

You estimate all ai, bi, ci, θ simultaneously

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Joint maximum likelihood the seems like a good approach as you obtain all the estimates you require at once. What are some disadvantages of this method? (4)

A

This is a massive estimation as you need to do this for every parameter (of which there can be hundreds or even thousands), for every participant (of which there can be hundreds or even thousands)

Some properties of joint MLE also do not help in that, for instance, if all the items are correct, ^θ approaches infinity, if they are all incorrect, ^θ approaches -infinity. This may make sense intuitively but statistically it is a disaster as it is really hard to calculate and draw inferences from

This also holds for the items as if a person has all the items correct or incorrect, then ^𝑏𝑖 →∞or ^𝑏𝑖 →−∞.

It also has undesirable asymptotic properties as if your sample size approaches infinity then you will god a new θ parameter for each subject giving you infinity θ (latent trait) parameters. The more subjects should increase certainty but if you keep adding new subjects then this adds uncertainty. This means it essentially violates the properties of maximum likelihood since these assymptotic properties don’t hold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do the other two MLE estimations tackle the challeges posed with Joint MLE?

A

They do not require to estimate the latent trait values explicitly for each person separately. θ is no longer a parameter, all free parameters are the item parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is involved in conditional maximum likelihood?

A

Conditional on the sum score, θ disappears. When you plug in the sum score for each person, the latent trait is cancelled out from the likelihood function and you only have the item parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the disadvantages of the conditional MLE?

A

Its only applicable to the Rasch model (since there is no common discrimination parameter)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do you calculate conditional MLE in R? What kind of parameter does this calculation use?

A

Using the package eRm:
res <- RM(data)
res

coef(res) #to get the parameters

Estimates item easiness bi:
𝑃 (𝑋𝑝𝑖 =1 | 𝜃𝑝) = e^𝜃p + bi / 1 + e^𝜃p + bi
(rather than difficulty parameter e^𝜃p - bi / 1 - e^𝜃p + bi)
(So the larger the value, the easier the item)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the advantage of using marginal MLE?

A

It is applicable to practically all IRT models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a disadvantage of using marginal MLE?

A

It assumes a normal distribution for θ, you need this normal distribution to marginalise out the latent variable; you sum over all the possible values for the latent variable in your likelihood function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you calculate marginal MLE in R?

A

Use package ltm in R (as well as many of the estimations for IRT)

library(ltm)
res1pl = rasch(data)
res1pl

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

In docs you’ll see a screenshot of conditional MLE output. Describe what information this output gives

A

The conditional log likelihood indicates the value of the likelihood of the data given these parameters at the maximum. The difficulty parameters give the difficulty parameter estimate for each item.

This gives the easiness parameters, the first beta listed is one of the easiest items with a high score and the last beta listed is one of the most difficult items with a low score (true only for this data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

In docs you’ll see a screenshot of marginal MLE output. Describe what information the coefficients give

A

You get a common discrimination parameter at the bottom since marginal likelihood assumes that the slopes are equal. The rest are normal difficulty parameters loosely listed from easier parameters to more difficult parameters (true only for this data) with lower scores indicating easier items

The log.Lik indicates the log likelihood; the value of the likelihood of the data given these parameters at the maximum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

After the output of the marginal MLE in docs there is another ss of output. What does this show?

A

The marginal MLE of a two parameter model obtained in another way (rest_ltm = ltm(data~z1); summary(res_ltm))

This gives more or less the same information but also adds the std error and z value for each of the difficulty parameters as well as the AIC and BIC of the model

29
Q

What do you obtain if you run plot(res_ltm) in following the previous code:

rest_ltm = ltm(data~z1);
summary(res_ltm)

A

You get the item characteristic cruves of each of the items with probability on the y axis and ability (theta) on the x axis. See docs

30
Q

Below the first item characteristic curve in docs you see two two others which are perhaps more readable. What are these and how are they obtained?

A

These are items 1:10 and items 11:20 plotted separately, with the range for theta plotted specified at -10 to 5 and the lines made thicker achieved through:
plot(res_ltm,item=1:10,z=seq(-10,5,by=.1),lwd=2)
plot(res_ltm,item=11:20,z=seq(-10,5,by=.1),lwd=2)

30
Q

Below the first item characteristic curve in docs you see two two others which are perhaps more readable. What are these and how are they obtained?

A

These are items 1:10 and items 11:20 plotted separately, with the range for theta plotted specified at -10 to 5 and the lines made thicker achieved through:
plot(res_ltm,item=1:10,z=seq(-10,5,by=.1),lwd=2)
plot(res_ltm,item=11:20,z=seq(-10,5,by=.1),lwd=2)

31
Q

What is required from the model in order for maximum likelihood to work? Describe what this means

A

It must be ‘identified’; this is to do with the fact that the latent trait in the model is an unobserved variable and thus does not have a scale. A parameter in a 2 parameter model for instance could have a value of 1000, but this doesn’t mean anything unless we specify what a score of 1000 means.

32
Q

What occurs if the model is unspecified in MLE?

A

You can plug in different parameters into your model e.g:
a1 = 2, b1 = 3, θ = 1; μθ=0 σθ=1
= 𝑒^(2× (1−3)) / 1+𝑒^(2× (1−3)) ×𝑁 (1;0,1 )
and
a1 = 2, b1 = 5, θ = 3; μθ=2 σθ=1;
= 𝑒^(2× (3−5)) / 1+𝑒^(2× (3−5)) ×𝑁 (3;2,1 )

both give a probability of 0.004355 despite different parameter values for both the difficulty and latent trait, you can play with these parameters and still get the same probability which is not very useful for calculating the maximum likelihood estimate where you want to find values which maximise the data.

33
Q

Why do these indifferences in the outcomes occur if the model is unspecified?

A

The mean and distribution of the latent trait follows the difficulty parameters, the slopes determine the width of this distribution. If the model is not specified and a scale is not given to theta then the inferences are drawn relative to an arbitrary normal distribution which will draw the same conclusions without taking into account its location or span of the range of the latent trait. This is visualised in docs

34
Q

What steps should you typically follow in identifying your model?

A

𝜃𝑝 is an unobserved variable which has no unit. Therefore, you need to identify the unit; A latent variable is identified by fixing an arbitrary location and an arbitrary scale

35
Q

Give three examples of how the location of a model ca be set and what package employs this method if relevant

A

You can set it so bi = 0 for some value of i. E.g the first difficulty item is a reference point for theta = 0; difficulty of first item = 0

Another option is fixing the sum/ average of the item difficulties = 0; E(|n, i=1|)bi = 0. This is used in R package ‘eRm’

A further option is to se the mean of the latent trait to 0; 𝜇𝜃 = 0. This is used in R package ‘ltm’

Each of these are visualised in docs

36
Q

Give three examples of how the scale of a model ca be set and what package employs this method if relevant

A

Similarly, you can set it so ai = 1 for some i. E.g the slope of the first item is a reference point for theta = 1

Another option is fixing the product of the item slopes = 1; ∏ (|n, i=1|)ai = 1.

A further option is to se the standard deviation of the latent trait to 1; 𝜎|2,𝜃| = 1. This is used in R package ‘ltm’

Each of these are visualised in docs

37
Q
When carrying out MLE on a rasch model in eRM:
res <- RM(data)
res
What would you get if you computed:
sum(res$betapar)?
A

You would get a value very close to zero as the sum of the difficulty parameters is used as a reference point of 0 for the latent trait

38
Q

When assessing model fit; how do you test for uni-dimensionality?

A

If the data are uni-dimensional, there should be one dominant component in the data
→Therefore you can use principal component analyses of the tetrachoric correlation matrix (despite it not being a LVA, it can inform you about the dimensionality of data.)

39
Q

Why do we use this type of correlation matrix to carry out the PCA to assess dimensionality?

A

Because in IRT we have ordinal or dichotomous data we cannot use a normal correlation matrix, thus we have to use a tetrahoric correlation matrix for categorical data.

40
Q

How can we assess whether there is one dominant eigenvalue?

A

Eigen values should always be decreasing and you should be able to eye this from the eigenvalue output, e.g:
13.76 1.81 1.09 1.06 1.01 0.99 0.97 0.96 0.94 0.91 0.90 0.88 0.86 0.86 0.84 0.82 0.81 0.80 0.79 0.79 0.77 0.75 0.73 0.73 0.72 0.70 0.70 0.68 0.67 0.67 0.66 0.65 0.63

You can then use Kaiser’s criterion (eigen values larger than 1 are kept; not really recommended), a scree plot and/ or a parallel analysis to assess this further

41
Q

What does carrying out a parallel analysis to check for one dominant component in a PCA entail?

A

R simulates data with uncorrelated variables which resemble your variables (so zero components in the data or as many components as you have items) so that its all noise. This then produces a good reference point (for what you would expect if there were no components in the data) which you can plot on your scree plot as a line, in which you extract the components above the line. This is shown in docs

42
Q

How can you check for equal discrimination parameters in your model fit?

A

You can carry out a rough check using item-test correlations/ item-rest correlations. If the items discriminate equally well, these correlations should be equal.

43
Q

How would you check for equal discrimination parameters in your model fit in R?

A

Xtotal = apply(data,1,sum)

ou2 = c()
for(i in 1:nit){
ou2[i] = cor(data[,i], Xtotal - data[,i]) #item rest correlations

ou2

You can then look at the correlations produced and assess whether they are roughly equaly, you can also plot them on a histogram to assess this

44
Q

When assessing the model fit, how would you assess the affect of guessing? Give three methods

A

Test on the basis of θ estimates

Test on the basis of sum scores

Test on the basis of the three parameter model

45
Q

What does testing on the basis of θ estimates entail?

A

If there is no guessing, subjects in the lower range of θwill fail on the difficult items.

46
Q

What is a problem with testing on the basis of theta estimates?

A

If you determine theta, you already assume an IRT model

47
Q

What does testing on the basis of the sum scores entail methodologically?

A

If there is no guessing, subjects in the lower range of the sum score will fail on the difficult items. You can test this with item-test regressions; calculating the proportion correct for all subjects with a given test score. This can then be plotted, an idea of the difference in graphs of the people guessing and not guessing is given in docs; proportion of correct responses should approach zero as an overall test score approaches 0 especially for difficult items. If there are big jumps then there is likely guessing

48
Q

How would you test for guessing using the three parameter model? Also give the R code

A

First you would fit a 3 parameter model to the data:
library(ltm)
res3pl <- tpm(data, start.val = matrix(c(.5,1,1), 20, 3, T))
res3pl

Then you can read the estimated values for the guessing parameter for each item to assess whether there is guessing happening for each item

49
Q

How would you assess the model predictions?

A

Construct 𝑚 ability groups and determine the model residuals: 𝑃𝑖𝑗 −𝐸(𝑃𝑖𝑗)
• Pij is the observed proportion correct in ability group j for item i
• E(Pij) is the expected proportion correct in ability group j for item i according to the model, e.g.,:
𝐸(𝑃𝑖𝑗)=𝑃(𝑋𝑖𝑗 =1|𝜃)= e^(ai(𝜃j - bi)) / 1 + e^(ai(𝜃j - bi))

aka difference between the proportion observed and expected proportion- in an ideal world this is 0 and you model perfectly predicts the proportions observed. You can then plot the observed proportion against the observed proportion and adjust your model accordingly; e.g adjust ai to make it more steep

50
Q

What statistic is used in model predictions?

A

To standardise the residuals you use the following formula:
Zij = 𝑃𝑖𝑗 −𝐸(𝑃𝑖𝑗) / sqrt(𝐸(𝑃𝑖𝑗)[1 - 𝐸(𝑃𝑖𝑗)] / Nj )
You then square it and sum over people to get the Qi statistic
Qi = E Nj[𝑃𝑖𝑗 −𝐸(𝑃𝑖𝑗)]^2 / 𝐸(𝑃𝑖𝑗)[1 - 𝐸(𝑃𝑖𝑗)] / Nj
where m is the number of ability groups Nj is the number of subjects in ability group j

It has a nice property of forming a chi squared distribution:
𝑄𝑖~𝜒2 (𝑚−𝑘)
where k is number of parameters per item (i.e., 2 for two-parameter model, but also for one-parameter model!)

51
Q

How can you obtain the Qi statistic in R? Interpret the output

A

item.fit(res2pl)
where the model is stored in res2pl

the x^2 statistic in the output is the Qi statistic. A significant result indicates that there is a significant difference between the observed value and the one predicted by the model

52
Q

Adding _____ will always increase the likelihood. Why is this not always a good solution?

A

Adding parameters will always increase the likelihood; adding more parameters will always fit the data better but more complex models often do not model important patterns in the data and instead model noise, overfitting the data. It is better to choose a more simple model rather than a complex model if they perform around the same

53
Q

What test can you carry out in order to compare two models?

A

A likelihood ratio test; null hypothesis test to compare statistical models

54
Q

What conditions does a likelihood ratio test assume?

A
  1. Models need to be nested
    - one model needs to be a constrained version of the other model e.g A one parameter model is a constrained version of the two parameter model
  2. Constraints cannot be boundary constraints
    - e.g., correlation of 1; variance of 0; guessing of 0
    : a boundary constraint would be fixing the guessing parameter to 0 since that is the absolute minimum that the guessing parameter can be (lower bound) and its also the boundary of the parameter space. Same logic for the correlation and variance above
55
Q

How do you calculate the likelihood ratio?

A

likelihood ratio: Likelihood of the constraint / Likelihood unconstraint
aka less complex model / more complex model
e.g L(1 parameter model) / L(2 parameter model)

56
Q

When can we say that one model is nested by another model?

A

If you can go from one model to another by fixing some parameters then the models are nested. E.g to go from the three parameter model to the two parameter model you can fix ci to 1 or 0, you can further go to a one parameter model by equating ai across items in a two parameter model

57
Q

Why can’t you compare the three parameter model and two parameter model using the likelihood ratio?

A

Boundary constraint

58
Q

What hypothesis are you testing with a likelihood ratio test?

A

Whether some constraints hold e.g:
H0: a = a1 = a2 = … = an
aka H0: testing that discrimination parameters are equal

59
Q

What hypothesis are you testing with a likelihood ratio test?

A

Whether some constraints hold e.g:
H0: a = a1 = a2 = … = an
aka H0: testing that discrimination parameters are equal

60
Q

How do you carry out the likelihood ratio test?

A

get the likelihood ratio:
likelihood ratio: Likelihood of the constraint / Likelihood unconstraint
aka less complex model / more complex model

Then get the -2log of the ratio for the test statistic:
test statistic = -2log(likelihood ratio)
= -2(log(L(constraint) ) - log(L(unconstraint)))

61
Q

Why do we take the -2log of the likelihood ratio? (2)

A

To get a chi squared distribution and because the log likelihoods are much nicer to work with than the likelihoods

62
Q

What are the properties of the likelihood ratio test?

A

𝐿𝑅𝑇~𝜒2 (𝑘 𝑢𝑛𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡 −𝑘 𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡)

Where 𝑘 denotes the number of parameters in the unconstraint and constraint model

63
Q

How can you calculate this test statistic given the output for your two models

A

Take the log likelihood for each e.g
logL(1 parameter / constraint) = -13774
logL(2 parameter / unconstraint) = -13667

likelihood ratio:
LRT = -2 -2(-13774- -13667) = 215.36

Degrees of freedom: ktwoPM - KRasch = 40 - 21 = 19
• 𝑘𝑅𝑎𝑠𝑐ℎ: 1 discrimination and 20 difficulties
• 𝑘2𝑃𝑀: Two-PM: 20 discrimination and 20 difficulties

64
Q

How can you calculate this test statistic in R?

A

res2pl = ltm(data~z1)
res1pl =rasch(data)
anova(res1pl, res2pl)

note: nothing to do with an anova

65
Q

How do you interpret the significance of a lieklihood ratio test?

A

If it significant it means that the constraints do not hold; they are not equal across items

66
Q

How can you interpret the AIC and BIC in the likelihood ratio test output?

A

They can be treated as a fit index. Whichever model has the smallest value is the best fit. It assesses the accuracy while punishing models with higher numbers of parameters

67
Q

How does the AIC and BIC make up for some of LRT’s shortcomings?

A

They can use a boundary constraint so they could compare a three parameter IRT model with a two parameter model