L2: Classical test theory Flashcards
Ch 5, 6, 7
what is the central statistic of classical test theory?
definition & synonyms
summed item score (sum of the scores on the items)
synyonyms: sum score, test score, score on the test
what is the central idea behind classical test theory?
- every test taker has a true score on a test, which is underlying the summed item score
- true score: score that you would get using a perfect measurement instrument
- observed score will generally not equal the true score due to measurement error
define measurement error
- other influences that cause random noise in the observed score
- goal is to minimize this error to improve reliability
what are the 2 core assumptions underlying classical test theory?
assumptions & what follows
- observed scores are true scores plus measurement error: Xo = Xt + Xe
- measurement error is random
what follows:
- mean of the error = 0, because a nonzero mean would make the measurement error systematic
- correlation between true score and error = 0 (Rte=0) because the mean of error is 0
- observed score variance = true scores variance + error variance (So^2 = St^2+ Se^2
what are the 4 ways of thinking about reliability?
as a proportion of variance:
- ratio of true score variance to observed score variance
- lack of error variance (reliable tests have minimal error variance)
as shared variance:
- correlation between observed scores & true scores (reliability is the squared correlation between these 2)
- lack of correlation between observed scores & error scores (highly reliable test shows lil correlation between observed scores & error)
how can you define reliability as a proportion of variance?
comes from So^2 = St^2 (signal) + Se^2 (noise) assumption
high reliability when most of So^2 is St^2
low reliability when most of So^2 is Se^2
reliability = signal / signal + noise = St^2 / St^2 + Se^2 = St^2/So^2
and
reliability = 1 - noise / signal + noise = 1 - ( Se^2 / (St^2 + Se^2)) = 1- Se^2/So^2
how can you define reliability as shared variance?
low reliablity if Xt (true score) shares not a lot of variance with Xo (observed score)
high reliablity if Xt shares a lot of variance w Xo
reliability = correlation (Xo, Xt)^2 = Rot^2 aka the amount of variance shared by observed score and true score
and
reliability = 1- correlation (Xo, Xe)^2 = 1- Roe^2 aka 1 - the amount of variance shared by observed score and error score
what are the 4 models to test reliability from most restrictive to least restrictive?
- parallel test (most restrictive)
- tau equivalent test
- essentially tau equivalent test
- congeneric test (least restrictive)
what are the restrictions of parallel test?
restriction on Xt1 (first test’ true score): needs to = Xt2
restriction on Se1^2 and Se2^2: need to be equal to each other (the variances of the measurement errors)
implication:
- mean of Xt1 and Xt2 need to be equal
- variance of Xt1 and Xt2 need to be equal
- mean of Xo1 and Xo2 need to be equal CAN BE TESTED
- correlation between Xt1 and Xt2 = 1 (Rt1t2 = 1)
- reliability of test 1 = reliability of test 2, so also Rt1o1 = Rt2o2 (correlation between observed and true score)
- variance of observed scores on test 1 and test 2 are equal (So1^2 = So2^2) CAN BE TESTED
what is the model of observed score of test 1 and 2 according to parallel test? and of the true score?
model observed score on test 1: Xo1 = Xt1 + Xe1 and Se1^2 = Se2^2
model observed score on test 2:
Xo2 = Xt1 + Xe2 and Se2^2 = Se1^2
model for the true score:
Xt2 = Xt1
what are the 2 types of reliability based on the parallel test model?
- test restest reliability
- split halves reliability
what are the restrictions on the tau equivalent test?
restriction on Xt1 (first test’ true score): needs to = Xt2
no restriction on measurement error variances
implications
- mean of Xt1 = mean of Xt2
- variance of Xt1 = variance of Xt2 (St1^2 = St2^2)
- mean of Xo1 = mean of Xo2 CAN BE TESTED
- correlation between true scores on test 1 and test 2 = 1 (Rt1t2 = 1)
what type of reliability is based on essential tau equivalent test model?
cronbachs alpha
what is the model of observed score of test 1 and 2 according to essential tau equivalent test model? and of the true score?
model for true score:
Xt2 = a + Xt1
model observed score of test 1: Xo1 = Xt1 + Xe1
model observed score of test 2: Xo2 = a + Xt1 + Xe2
what is the model of observed score of test 1 and 2 according to tau equivalent test model? and of the true score?
model for true score: Xt2 = Xt1
model observed score of test 1: Xo1 = Xt1 + Xe1
model observed score of test 2: Xo2 = Xt1 + Xe2
what are the restrictionson essentially tau equivalent test?
restriciton on Xt2: a + Xt1 (true scores on second test are equal to true scores on first + any number)
no restriction on measurement error variances
implications:
- mean of true scores are different
- variance of true scores are equal (St1^2 = St2^2)
- correlation between true scores on test 1 and test 2 is 1 (Rt1t2 = 1)
what is the model of observed score of test 1 and 2 according to congeneric test model? and of the true score?
model for true score:
Xt2 = a + bXt1
model observed score test 1:
Xo1 = Xt1 + Xe1
model observed score test 2:
Xo2 = a + bXt1 + Xe2
what are the restrictions on the congeneric test?
Xt2 = a + bXt1
no restriction on measurement error variances
implications:
- mean of true scores are different
- variances of true scores are different
- correlation between true scores on test 1 and 2 = 1 (Rt1t2 = 1)
what reliability measure is based on congeneric model?
omega
what are 3 methods of estimating reliability?
- alternate forms reliability
- test retest reliability
- internal consistency reliability
what is the alternate forms reliability estimation technique?
- assumes parallel test model (meaning they measure same trait w same amoutn of error variance)
- apply 2 versions of the same test
- correlation between the 2 forms is the reliability
what are the main challenges with the alternate forms reliability estimation technique?
- constructing the alternate forms of the same test is hard
- carry over effects (lack of motivation, fatigue etc)
what is the test retest reliability estimation technique?
- assumes parallel test model
- apply same test twice to same group but at different times
- correlation is the reliability
- assumes the trait being measured remains stable over time (which isnt always the case)
what are the main challenges with test retest reliability technique?
- carry over effects
- change in the true score: for constructs that fluctuate, like mood, the true score might change between the 2 tests
what is the internal consistency reliability estimation technique?
- looks at how consistent the items within a single test are w each other, if items on a test measure the same trait, the test should be internally consistent
- assumes parallel or essential tau equivalent test model (theres multiple internal consistency techniques)
- consider (blocks of) items as seperate tests
- formula will give the reliability
what are the main challenges with the internal consistency reliabiltiy technique?
carry over effects (by end of test might be way better at answering or more tired or whatever so different parts of test might not be as consistent w each other)
what are the 3 types of internal consistency reliability techniques?
- split half
- assumes parallel test model
- split test in 2 parts
- formula gives reliabilty (see table 6.2) - cronbachs alpha (or KR20 for binary items or standardized alpha)
- assumes essential tau equivalent test model
- each item considered separate part
- formula gives reliability (see table 6.2) - omega
- assumes congeneric test model (or stricter)
- not applied in practice yet
- estimate true score variance using unidimensional factor analysis
- reliability is true score variance / observed score variance
evaluate the 4 main ways of estimating reliability: alternate forms, test retest, split half, cronbachs alpha
- alternate forms: hardly feasible in practice, only in specific situations
- test retest: important to establish reliability for new test, in research only rarely used
- split half: depends highly on split used (undesirable), still used frequently
- cronbachs alpha/kr20: very pop in research due to its ease, assumption (essential tau equivalence) hardly met; its the lower bound of the reliability! so the actual reliability will be equal to or higher then cronbachs alpha
what are the COTAN guideliens about reliability for psych tests?
- tests used for high impact inferences at individual level (ex: personnel selection, diagnosis of learning disabilities etc)
- good: 0.9 or larger
- sufficient 0.8-0.9
- insufficient: smaller than 0.8 - tests used for less impact inference at individual level (descriptive use ex: study/therapy progress, career choice tests etc)
- good: 0.8 or larger
- sufficient: 0.7-0.8
- insufficient: smaller than 0.7 - tests used at group level (ex: customer/team satisfaction, student evaluations, comparing groups etc)
- good: 0.7 or larger
- sufficient: 0.6-0.7
- insufficient: smaller than 0.6
what is item discrimination?
differences in the item scores reflect differences in the construct (indicates how good the items are)
- item total correlation: correlation between item scores & sum scores (how well does an item predict the total score on the test?)
- corrected item total correlation: correlation between item scores & rest scores (how well does an item predict the other item scores excluding the item u are looking at)
what is the problem with the item-total correlation? what resolves this?
its biased upwards as you are correlating an item with itself (partly)
-> this problem solved with corrected item total correlation as it exluces the item itself
what are the factors affecting reliability?
- test length
- sample heterogeneity
- the correlation between pretest and posttest scores
how does a tests reliability get affected by test length?
lengthening a test will generally increase reliability
what is the equation for a test’ reliability if its length has been changed?
Rnew = n x Roriginal / 1+(n-1)Roriginal
where n = new amount of items / original amoutn of items
how does a test reliability get affected by sample heterogeneity?
in homogeneous samples, reliability will be smaller than in heterogenous samples
R = St^2 / St^2 + Se^2
in homogeneous samples, St^2 will be smaller as ppl are relatively similar to each other
while in heterogenous samples St^2 will be larger as ppl are relatively dissimilar to each other
how does a tests reliability get affected by the correlation between pretest and posttest scores?
difference score: Di = Xi (post test) - Yi (pretest): difference between post and pretest scores
difference score reliabilty:
𝑅𝑑 = 𝑠𝑋𝑜^2 𝑅𝑋𝑋 + 𝑠𝑌𝑜^2 𝑅𝑌𝑌 − 2𝑟𝑋𝑜𝑌𝑜𝑠𝑋𝑜𝑠𝑌𝑜 / 𝑠𝑋𝑜^2 + 𝑠𝑌𝑜^2 − 2𝑟𝑋𝑜𝑌𝑜𝑠𝑋𝑜𝑠𝑌o
pretest variancereliability of pre tests + posttest variancereliability of posttest - 2xcorrelation between pre and posttestsd of pre testsd post test
important properties:
- if correlation between pretest and posttest is large, reliability will be small
- aka, difference reliability depends on reliability of the pretest and posttest
- sensitive to difference in variance between Xi and Yi
how can you estimate true scores?
- true score estimate = summed item score
- true score estimate: Xest = mean of Xo (observed scores) + Reliability (Xo (score of person youre interested in) - mean of Xo)
- based on regression to the mean
- due to unreliability, high scoring persons will likely score lower on a next test
-> the lower the reliability, the more the true score estimate is pulled toward the mean
what is the standard error aka standard error of measurment?
amount of error present in an individuals score
Sem= So (sd of observed scores) * sqrt (1-Reliabilty)
so higher reliability = smaller sem
lower reliability = higher sem
can be used to construct 95% confidence interval around true score estimate
what is attenuation?
effect size/correlations observed will be SMALLER than the effect sizes / correlations of the true scores (because observed scores are diluted by error)
+ correlation is smaller & less likely to be significant for less reliable test anyway (so always consider reliabilty)
aka when measurements arent reliable (cus theres measurement error), it weakens the relationships observed between variables
what is wrong w corrections for attenuation?
the corrections done on the observed scores in order to remove the error in them, can be wrong !
When is reliability high?
- when there is little error score variance relative to the true score variance
- when the sum of the true score variance & error variance comes close to the true score variance
- when the proportion of error variance in the observed variance is small
what is the relationship between standard error of measurment & reliability?
A smaller standard error of measurement means that there is less deviation of observed scores from true scores, so a more reliable test
Consider two tests that purport to measure the same construct. In a pilot study, a researcher finds their observed test score means to be the same, but their test score variances to not be the same. Which of the test models do these data follow?
tau equivalent test
Which test model does Cronbach’s alpha assume?
essentialy tau equivalent test model or stricter
which model does the test-retest reliability assume?
parallel test model
which model does the split half reliability assume?
parallel test model
In a hypothetical dataset that contains the test scores on two tests, the true score mean and true score variance differ across the two tests. Which test model does this dataset follow?
the congeneric test model
If you find the reliability of two tests measuring the same construct to be the same, what test model do these tests follow?
parallel test model
In a hypothetical dataset that contains the test scores on two tests, the true score mean and true score variance are equal across the two tests. Which test model does this dataset follow?
parallel test model & tau equivalent test model
Say you want to assess the consistency between the observed scores of one test and those of another test. Which method for estimating reliability do you use?
alternate forms reliability
which criteria do 2 test forms need to meet, in order to legitimately use the alternate forms method of estimating reliability?
tests need to have identical true scores & identical error variance
Jimmy conducts a study into aggression, for which he uses the Aggression Questionnaire (AGQ; Buss & Perry, 1992). He wants to know how reliable the AGQ is. Therefore, he lets his respondents fill in the questionnaire again.
Which method of estimating reliability does Jimmy intend to use here?
test retest
When someone calculates Cronbach’s alpha to estimate the reliability of a test, what general method of estimating reliability is that person using?
internal consistency
For which of the reliability methods, it is problematic if the true scores differ across two tests?
- alternative forms
- test retest
for which reliability methods is it problematic if there are carry over effects?
- test retest
- internal consistency
- alternative forms
What is a problem that arises from using the internal consistency method for estimating reliability?
A correlation between the item’s error scores caused by carry-over effects
for which items are the the following reliability measure suitable? raw alpha, KR20, standardized alpha
raw alpha: LIkert scale items that do not differ in variance too much
KR20: binary items
standardized alpha: Likert scale items that differ substantially in their item variance
what is a difference between raw alpha, KR20, and standardized alpha? not concerning which items they are used on
Raw alpha and KR20 are based on the item covariances and item variances; standardized alpha only uses item correlations
In what situation is it good to use standardized alpha?
When the item variances differ a lot from each other and thus the test score mostly reflects items with high variances
What can we do to improve the reliability of a test?
Add more items to the test that are perfectly parallel to the original items
is the reliabiltiy of the test smaller in a heterogenous sample or homogeneous sample?
homogenous sample
If the pretest and posttest are both reliable, the reliability of the difference scores can still be relatively small if the pretest and posttest are…
highly correlated
where can you find the split halves reliability? what about the reliablity of one halve?
split halves reliabilty: spearman brown coefficient
reliability of one halve: correlation between forms
Which statistic do we use when we want to know the consistency between one item and the other items of a test?
Corrected item-total correlation
define reliability
consistency or stability of test scores across repeated applications. It’s a crucial aspect of psychometrics because it determines how much trust can be placed in test results.
what is the main assumption of the parallel test?
that 2 tests measure the same trait w equal true scores and error variances
what is the main assumption of the tau equivalent test?
that 2 tests have equal true score variances but can differ in error variances
What is the main assumption of the congeneric tests?
that a linear relationship between true scores across 2 tests exists, allowing for flexibilty in error & true score variances
what is the domain sampling theory?
treats test items as a sample from a larger domain (bucket) of possible items
- the reliability of a test is the average correlation between all possible pairs of tests drawn from that domain: basically how consistent the test resuts would be if you made different tests by pulling out different sets of items from the bucket of all possible questions
basically this theory is saying “we want to know if the questions we randomly picked give a reliable pic of the persons true ability, even if we swapped them out for different questions from the same big bucket”