Lecture 1: Classic Test Theory Flashcards
When psychologists assess the quality of a test, what two metrics do they typically refer to?
Validity and reliability
What is test variance and how do you calculate it? (2)
Item variance is the measure of dispersion of the scores on item i. The test variance is the measure of the dispersion of the test scores. A covariance matrix is constructed in which the variance of each item is along the diagonal and the covariance between each item is displayed. The test variance is the sum of all these values in the matrix or the variance of the final test scores, its the same value.
Whats the difference between covariance and correlation if there is one?
Covariance is an unscaled measure of association between variables, correlation is a scaled measure of association between variables between -1 and 1
What can be used to infer the dimensionality of a test in CTT?
Principle Component analysis (PCA)
What is meant by Principal component analysis (PCA)?
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimising information loss. It does so by creating new uncorrelated variables that successively maximise variance. E.g reducing something (e.g tumour) with 30 dimensions (smoothness, volume) to two principle components
Summarise the main steps of how PCA is calculated
We calculate the covariance matrix of our data, we calculate the eigenvectors of the covariance matrix, and this gives us our principal components. The eigenvector with the largest eigenvalue is the first principal component, and the eigenvector with the smallest eigenvalue is the last principal component.
What does Xgp represet in CTT?
πππ is a random variable denoting the repeatedly sampled measurements of test g on subject p.
What two fundamental equations can be derived from CTT?
- πΈ (πππ) =πππ
The expected value of Xgp is equal to the true value - πΈππ =πππ βπππ (for a fixed subject)
(error = observed score - true score)
What three assumptions are there within CTT?
(a) the measurement is on an interval scale;
(b) the variance of observed scores π2 ππ is finite;
(c) the measurements are repeatedly sampled in a linear, experimentally independent way.
What 8 properties are derived from CTT?
- The expected error score is zero;
- The correlation between true and error scores is zero;
- The correlation between the error score on one measurement and the true score on another measurement is zero;
- The correlation between errors on linearly experimentally independent measurements is zero;
- The expected value of πππ over persons is equal to the expected value of the true score random variable over persons;
- The variance of πΈππ over persons is equal to the expected value, over persons, of π2 πππ (variance within persons);
- Sampling over persons with any πππ, the expected value of the error score random variable is zero;
- The variance of observed scores is the sum of the variance of true scores and the variance of error scores;
Give proof that the expected error score is 0
*Not required but gives an idea of how CTT is derived
- πΈ (πππ) =πππ (fundamental Eq. 1)
- πΈππ =πππ βπππ (fundamental Eq. 2)
πΈ (πΈππ) = πΈ (πππ βπππ) = πΈ (πππ) βπΈ (πππ)* = πππ - πππ = 0
*For one person πππ is fixed
Give the proof for the following:
- The correlation between true and error scores is zero;
- The correlation between the error score on one measurement and the true score on another measurement is zero;
- The correlation between errors on linearly experimentally independent measurements is zero;
*Not needed to reproduce exact theorems
- The correlation between true and error scores is zero;
ππ =ππ +πΈπ
or π =π+πΈ
πΈ πΈππ =0 (property 1) β πΈ (πΈπ |ππ =πππ) =πΈ (πΈππ) =0 for all πππ β π (πΈπ,ππ )=0
If you know the error is 0 for each person, the expected value of the error is also 0. Therefore there cannot be a correlation between the error and the true score.
- The correlation between the error score on one measurement and the true score on another measurement is zero;
πΈ πΈπ =0 (property 1) β πΈ (πΈπ |πβ =πβπ) =0 for all πβπ β π (πΈπ,πβ) = 0
Same logic; if error is zero, it cannot be correlated with true score
- The correlation between errors on linearly experimentally independent measurements is zero;
πΈ (πΈπ) =0 (property 1) β πΈ (πΈπ |πΈβ =πΈβπ) =0 for all πΈβπ β π (πΈπ,πΈβ) =0
Same logic; if error is zero, it cannot be correlated with other errors
Give the proof of property 8: The variance of observed scores is the sum of the variance of true scores and the variance of error scores;
- π πΈπ,ππ =0 (property 2)
- ππ =ππ +πΈπ (population model)
π2 (ππ) = π2 (ππ +πΈπ )*= π2(ππ)+π2 (πΈπ) +2π( ππ,πΈπ)
βπ2 ππ =π2(ππ)+π2 πΈπ
or π2/π =π2/π +π2/πΈ
*Covariance matrix
How can reliability be defined according to these terms (conceptually, and proof shown)
Conceptually: That it is the squared correlation between the test score and the true score for a participant
Using fundamental Equations 1 and 2, and property 2, reliability can be defined as:
π (ππ,ππ) = π(ππ,ππ) / π (ππ) π(ππ) = π(ππ +πΈπ,ππ) / π (ππ) π(ππ) = π (ππ,ππ) +π(πΈπ,ππ) / π (ππ) π(ππ) = π2 (ππ) +0 /π (ππ) π(ππ) = π (ππ)/ π (ππ)
βππ =π^2π,π =
= π(ππ,ππ) ^2
= (π(ππ)/π (ππ))^2 =
= π2 (ππ)/π2 (ππ)
corr between test score and true score
= formula for corr
= Xg = Tg + Eg
=covar of T+E & T can be written as covar of T & T + covar of E & T (rule)
=covar of T = the var, covar between E and T is 0 as explained before
= the π (ππ) cancel, leaving one on top
=Not there yet, reliability of x = corr between x and t squared (how much, in %, var of the total score variance is due to the true score)
=the corr squared = what we derived before
= the var of t / the var of x
How insightful is this definition of reliability?
insightful as π^2 (ππ) =π^2(ππ)+π^2 (πΈπ) (property 8)
These are theoretical equations, we cannot calculate them without the variance of true scores. How do we try to do this?
The concept of parallel tests: if you have test h with a parallel test form g.
What are the assumptions of parallel tests
You assume the true scores are identical on the two tests as well as the variance of the error.
How are parallel tests g and h defined mathematically?
πβπ =πππ β πβ =ππ =π
The true score on one test is the same as the true score of another for one subject
π2(πΈβπ)=π2(πΈππ) βπ^2 (πβ) =π^2 (ππ) =π^2(π)
If you have the same error variance, you have the same test score variance.
How do these definitions help us calculate the reliability of a test?
If you have the same true score between tests and the error and test score variance between tests then the correlation between the test scores is equal to the reliability of each test
Prove mathematically that the correlation between the test scores is equal to the reliability of each test
π (πβ,ππ)
= π (πβ +πΈβ,ππ +πΈπ)/ π (πβ) π (ππ)
= π (πβ,ππ) +π (πΈβ,ππ) +π (πΈπ,πβ) +π (πΈπ,πΈβ) / π (πβ) π (ππ)
= π (πβ,ππ )+0+0+0 / π (πβ) π (ππ)
= plug in the X = T + E from CTT divided by std of X for each (formula for covar)
= Same trick T+E,T = T+T, T + E for both (property 3 + 4)
=covariance of E with anything = 0, corr of true scores/ std of test scores
since parallel tests also say:
πβ =ππ =π
π2 πβ =π2 ππ = π2 π :
π (πβ,ππ)/π (πβ) π (ππ)
= π^2 (π) / π^2 π
=π^2π,π
=ππ
What is the next step towards really calculating reliability?
Chronbachs alpha: Assumes all the items are really parallel tests/items to use their ideas to calculate reliability. This is the most used index for reliability in psychology.
How is Cronbachβs Alpha given mathematically ad conceptually?
πΌππ = (π/πβ1)*(π|2π|βΞ£ππ^2 / π|2π|)
= (π/πβ1) (ΣΣπβ ππππ / π|2π| )
where n is the number of items
= π/πβ1 * test score variance - sum of item variances (diag of covar matrix) / test variance
= π/πβ1 * sum of all elements of covar matrix - diagonal of covar matrix / sum of all elements of covar matrix
so n/n-1 * sum of the undiagonal element for a covariance matrix divided by the sum of the matrix
When does Chronbachβs Alpha give the exact reliability? What consequences does this have?
Cronbachβs alpha gives the reliability if the test is essential tau-equivalent, i.e.,
ππ =πππ +ππ β π^2 (ππ) =π^2 (ππ)
Since this is rarely ever fulfilled, Chronbachβs alpha underestimates the reliability. It is essentially the lower bound of the reliability
How is Cronbachβs alpha a lower bound?
When Cronbacks alpha is derived reliability can be rewritten as:
ππ =π΄+ πΌπβπ΅
reliability = A + Cronbachs alpha - B
βIf the item/parts are essential tau-equivalent: π΄=π΅ so that π΄βπ΅ =0
βIf not, A will always be larger than B: π΄>π΅. Thus, Cronbachβs alpha is a lower bound!
Note: A and B signify other parts of the derived equation that satisfy these points
How was Chronbachs Alpha first introduced and how is it relevant to how it is treated today?
Cronbachβs alpha was just one of 6 proposed measures of reliability in the same paper although it is sometimes treated as the only measure. It is π3 when π1-π6 were proposed in that paper. Chronbach reinvented it but it existed before
What measure of reliability was later proposed by woodward and bentler (1980)?
Greatest lower bound:
Under Classical Test Theory, the variance of the test scores is given by:
π^2 (π) =π^2(π) +π^2 (πΈ)
Then, the greatest lower bound (Woodward & Bentler, 1980) is given by:
πΊπΏπ΅ππ = 1β max(πΈπ^2(πΈπ)) / π2(π)
The maximum variance of E possible, given that π2 π should
be positive
How can GLB be estimated?
GLB can only be estimated using an algorithm (Theres an R package)
How do 4 of these reliability estimates compare to each other in regards to size? What could be inferred from this?
π1-6 were proposed, π3 is C.A
π1 < πΌππ β€π2 β€ πΊπΏπ΅ππ
π1 < πΌππ =π2 = πΊπΏπ΅ππ (for essentially tau-equivalent items)
This could indicate that GLB would be the safest to use since in the worst case it is equal to CA, in the best case its larger than CA. This shows that CA is valuable because it follows from parallel testing but there are indices out there, some of which are arguably better.
Name 6 other practically useful statistics from CTT ad name when they are useful
β’ Split half reliability: ππ΅ππ = 2ππ1π2 / 1+ππ1π2
> π1 and π2 are the two halves
> If lower bounds are not meaningful, e.g., in randomized experimental trials
β’ Test-retest reliability: πππ‘ππ π‘ππ =ππ1π2
> π1 and π2 are the two administrations
> If the underlying construct is stable enough, and no memory effects
β’ Standard Error of Measurement (SEM): ππΈπ =ππΈ =ππ(1βππ)
> To determine a confidence interval for ππ
β’ Correction for attenuation (reduction in strength of signal?): ππππβ = ππππβ / πππππβ
> Where ππ is from one test and πβ from another
β’ Item mean
> As a measure of item difficulty
β’ Item-rest correlation
> As a measure of item discrimination
What criticisms have there been for test theory?
The true score is nothing more than the expected value of πon test π
πππ =πππ + πΈππ
πππ =πΈ(πππ) + πΈππ
As πΈ(πππ) is just a statistical expectation about test π:
- The true score does not necessarily correspond to a unidimensional construct score
- Statistics from CTT depend on both the item properties and the properties of the subjects
- The true score contains irrelevant -but systematic- item specific effects
What is meant by saying that the true score does not necessarily correspond to a unidimensional construct score?
The true score is just an expected value on a test, it is a statistical thing. Some people seem to give it some kind of magical status, they see it as the construct or as a dimension, latent variable but it is just an expected value. If your data contains a unidimensional construct or nicely/ accurately measures a unidimensional construct, then your true score might represent this but still youβre not sure.
What is meant by unidimensional constructs and why would we want to measure them?
Constructs with just one dimension e.g working memory, extraversion, openness to experience. As opposed to higher order constructs with multiple dimensions such as intelligence, emotional intelligence etc. The benefit of trying to measure unidimensional constructs is that you can infer that a measure of a score reflects a high level of one variable rather than a possible high score in a number of variables.
Every test has a true score. You can therefore apply CTT to any test. You can calculate sum score or reliability and you have applied classical test theory.
What is wrong with this?
You may not have checked if this is the right thing to do and if CTT is appropriate for your data. E.g a with a three unrelated question questionnaire you could likely get good test retest reliability, sum score etc despite the test measuring nothing.
Alternatively a test could be measuring two constructs. This may be observed by looking at the correlation matrix and seeing that there are two groups of questions correlating with each other. Cronbachβs Alpha and GLB, however, sum all items and so how do you interpret this sum score since it has two dimensions?
What does it mean to say that βStatistics from CTT depend on both the item properties and the properties of the subjectsβ?
Each intelligence test (Xπ,Xβ,Xπ,etc.) contains a different true score (ππ,πβ,ππ, etc.) with its own scale depending on
- The number of items (10 items vs 1000 items means a difference of steps of 0.1 or 0.001 on your scale)
- The difficulty/discrimination of the items
- The skill of the subjects that took the test (you in smart vs dumb sample)
However all of these tests would have the same true score since theyβre supposed to be measuring the same thing.
As a result, all statistics from classical test theory depend on
1. the properties of the test
β’ Item difficulty and item discrimination
2. the properties of the sample
β’ Mean and variance of the true scores of the subjects
How does variance affect reliability?
Say a group was measured on a construct that they do not differ much in accurately (e.g uva professors and intelligence) then it will be hard to score a high reliability even if you always get similar answers since variance is factored into the reliability equation
What does it mean to say that the true score contains irrelevant -but systematic- item specific effects?
E.g in the following example
- At parties, I always talk to everybody
- I like giving talks for large audiences
- In business meetings, I am the centre of attention
- If someone hurts me, I will stand up for myself
All involve extraversion and this plays a factor in the decision made, however there is the extra noise (item specific error) in each item in addition to the measurement error variance. For example that each item takes place in a different setting. Perhaps someone is into parties and has no experience with business. Perhaps someone is very involved in their work and does not go to parties. The sum score calculates the answers to all these items as part of the true score, however, despite the extra error since these will also likely carry reliability.
What was proposed to deal with these criticisms?
Latent variable models
How do Latent Variable models deal with these criticisms?
They specify an explicit measurement model, a statistical model which describes the relationship between the construct and the item. This is different to CTT where the true score is not an explicit construct it is an expected value. In CTT the expected value of a score on item i is equal to the true score, in LVM the expected value depends on the latent variable.
What is meant by latent variables and item parameters?
Latent variable (person parameter):
Unobserved dimension of individual differences
that underlies all items in a test
Item parameters:
Model the item properties (comparable to e.g.,
item-rest correlation, item means)
Show mathematically how in four important equations in Latent Variable Models, the latent variables play a factor for measurement model πΈ (πππ|ππ)
ππ refers to latent variables/ person parameters
ππ,π1π,π½π, π1π, +ππ refer to item parameters
β’ factor analysis:
e.g., πΈ (πππ|ππ) =ππ +ππππ
β’ Item response theory:
e.g., πΈ (ππ|π )ππ = exp(πΌπππ+π½π) / 1+exp(πΌπππ+π½π)
β’ Latent class analysis: e.g., πΈ(πππ| ππ) =π|ππ 0π| Γπ|1βππ 1π|
β’ Latent profile analysis: e.g., πΈ (πππ| ππ) =π|ππ 0π|Γπ|1βππ 1π|
What does a structural model, πΈ(ππ|π΅π) represent?
Structural model, πΈ(ππ|π΅π):
A statistical model describing the relation between construct and other variables, π΅π
β’ E.g., similar to a regression model, ANOVA, t-test, etc
What is the relationship between the structural model and the measurement model?
With the structural model you can take your latent variable of the construct and apply it to a regression model, anova etc. The measurement model accounts for all the measurement properties of the items so that you can safely make inferences with the structural model about the latent variable which do not suffer from problems which the true score suffers from
How do latent variable models address the following criticism of CTT?
The true score does not necessarily correspond to a construct score
Latent variable models are falsifiable
β’ There will be no latent variable in the data of unrelated questions
β’ only item specific effects which inflate test-retest reliability
β’ You will be able to tell from the latent variable variance (approaches 0)
A latent variable model with one latent variable will also not
fit the multidimensional data (model fit indices will indicate).
Instead, a latent variable model with two latent variables will fit these data (model fit indices will indicate)
How do latent variable models address the following criticism of CTT?
The true score depends on the scale of test g
In a latent variable model, test and sample properties are separated:
β’ Test properties will be captured by the item parameters
β’ Sample properties will be captured by the latent variable
Thus, all intelligence tests will be measuring the same latent variable
How do latent variable models address the following criticism of CTT?
The true score contains irrelevant but systematic
item specific effects
Recall that a latent variable is defined as : βUnobserved dimension of individual differences that underlies all items in a testβ
Thus summarise the advantages of latent variable models and give two disadvantages
Latent variables explicitly model the dimensionality of a test
Latent variable models are falsifiable
Latent variable models are not test and sample dependent
Latent variable models explicitly account for item specific error
But:
Require much larger sample sizes
Statistically more complex