Module 2: Norms and Reliability Flashcards
Classical Test Theory: (CCT)
is a model for understanding measurement.
CCT is based on the True Score Model. See Notes
True score:
is a person’s actual true ability level (i.e., measured without error).
Error:
is a component of observes score unrelated to test-takers true ability or trait being measures.
True variance and error variance:
thus, refer to the variability is a collection/population of test scores.
Reliability:
refers to consistency in measurement. See Notes.
Systematic error
Would be the same for everyone. Being in a noisy classroom made everyone perform 10 points worse.
Random error:
the good error. Unrelated to the persons true score/environment. There is nothing you can do about random error. You just need to be aware it is there. We are only trying to measure random error. Because, if there is a systematic error that is affecting everyone. Then everyone is treated the same.
Sources of Measurement Error:
- Test Construction: : Variation due to differences in items on same test or between tests (i.e. item/content sampling)
- Test Administration: Variation due to testing environment.
• Test taker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication).
• Examiner variables: (e.g., physical appearance, demeanour). - Test scoring and interpretation: Variation due to differences in scoring and interpretation.
- Sampling error: Variation due to representativeness of sample.
• The larger the sample, the smaller the sampling error. - Methodological errors: Variation due to poor training, unstandardized administration, unclear questions, biased questions.
Item Response Theory (IRT).
IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is “tuned” to that ability level.
IRT incorporates considerations of item Difficulty and Discrimination.
o Difficulty: relates to an item not being easily accomplished, solves, or comprehended.
o Discrimination: refers to the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or construct being measured.
You want different questions varying in degree of difficulty (speeding fine vs heroin) and you want to discriminate between the two (speeding fine is low level of trait, and heroin use is high levels of the trait)
CCT True Score Model vs. Alternatives
- True Score Model of measurement (based on CCT) is simple. Intuitive, and thus widely used.
- Another widely used model of measurement is Item Response Theory (IRT).
o CCT assumptions more readily met than IRT and assumes only two components to measurement.
o But CCT assumes all items on a test have equal ability to measure the underlying construct of interest.
o E.g., a test measures risk taking. Have you ever had a speeding fine? (risk taker) Have you used heroin before (risk taker). Speeding fine is more common than heroin. Therefore, Classical Test Theory those two items would be equivalent in assuming risk taking behaviour. Whereas in reality, that is not a fait assumption.
o Item response theory can address this by examining items specifically and see how those items perform differently in the construct attempting to be measured.
Reliability Estimates:
Because person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.
Common examples include:
• Test-retest reliability
• Parallel and alternate form’s reliability
• Internal consistency reliability
o E.g., split half, item correlation, Cronbach’s alpha
• Interrater/interscorer reliability
Test-retested reliability:
is an estimate of reliability over time.
- Obtained by correlating pairs of scores from same people on administration of same test at different times.
- Appropriate for stable variables (e.g., personality, NOT mood) (if it’s meant to be the same over 1 week then fine)
- Estimates tent to decrease as time passes.
Parallel forms:
Two versions of a test are parallel if in both versions, the means and variances of test scores are equal. E.g., in neuropsych, we might want to test the same thing twice, we can’t use the exact same test because the participant might remember the answers. The test has to be equal for this to work or else you test sperate things on separate tests.
• More strict
Alternate forms:
there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms.
• Obtained by correlating the scores of same people measures with the different forms.
Internal consistency measures:
Split half reliability
Inter-item consistency/correlation
Kuder-Richardson formula 20
Coefficient alpha
Split half reliability:
obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Entails 3 steps:
1. Divide the test into two halves.
2. Correlate scores on the two halves of the test.
3. Generate the half-test reliability to test the full-test reliability using the spearman brown formulae.
4. Cronbach’s Alpha
Inter-item consistency/correlation:
the degree of relatedness of items on a test. Able to gauge the homogeneity of a test.
Kuder-Richardson formula 20:
statistic of choices for determining the inter-item consistency of dichotomous items.
Coefficient alpha:
mean of all possible split-half correlations, corrected by the spearman-brown formula. The most popular. Values range from 0 – 1.
Cronbach’s Alpha
- Cronbach’s Alpha is often incorrectly used.
- Lower estimate of reliability.
- Not a measure of one-dimensionality.
- I.e., it is a function only of the number of items, and the average inter-item correlation.
- Only use if you think the items are homogenous on your test. If your items are not similar (measure the same thing) then don’t use Cronbach’s alpha.
- You need factor analysis to see the underlying correlation matrix between items.
Interrater/interscorer reliability:
is the degree of agreement/consistency between two or more scorers (or judgers or raters)
• Often used with behavioural measures (strange situation)
• Guards against biases or idiosyncrasies in scoring.
• Obtained by correlating scores from different raters:
• Us interclass correlation for continuous measures.
• Use Cohen’s kappa for categorical measures.
Choosing Reliability Estimates:
We will never know the true score.
The nature of the test will often determine the reliability metric e.g.,
• Are the test items homogenous or heterogenous in nature?
• Is the characteristic, ability, trait being measures presumed to be dynamic or static.
• The range of test scores is or is not restricted.
• The test is a speed or power test.
• Speed: how fast you can do the questions.
• Power: increasing difficulty across the test. (Split half not good for a power test)
• The test is or is not criterion-referenced.
• Test in order to pass you need to reach a certain threshold.
…Otherwise, you can select whatever you like.
The Standard Error of Measurement (SEM):
provides measure of precision of an observed test score i.e., estimate of amount of error in an observed score.
The Standard Error of the Difference (SED):
is a measure of how large a difference in test scores would be to be considered “statistically significant”. See notes