Module 2: Norms and Reliability Flashcards
What is Classical Test Theory? (CCT)
CCT is a model for understanding measurement
CCT is based on the True Score Model…
… for each person, their observed score on a test is comprised of: - Observed score (X) = True Score (T) + Error (E)
What is a true score?
True score is a person’s actual true ability level (i.e. measured without error).
What is error?
Error is a component of observed score unrelated to the test takers rue ability or trait being measured.
True variance and Error variance thus refer to the variability in a collection/population of test scores.
What is reliability?
Reliability refers to consistency in measurement.
- According to CCT: reliability is the proportion of the total variance attributed to true variance
What is test administration error?
Test administration: variation due to the testing environment
- Testtaker variables (e.g., arousal, stress, physical discomfort, lack of sleep, drugs, medication)
- Examiner variables (e.g., physical appearance, demeanour)
What is test scoring and interpretation error?
Test scoring and interpretation:
Variation due to differences in scoring and interpretation
What are methodological errors?
Variation due to poor training, unstandardized administration, unclear questions, biased questions.
CCT True-score Model vs. Alternative
- True Score Model of measurement (based on CCT) is simple, intuitive, and thus widely used
- Another widely used model of measurement is Item Response Theory (IRT)
- CTT assumptions more readily met than IRT, and assures only two components to measurement
- But, CTT assumes all items on a test have an equal ability to measure the underlying construct of interest.
Item Response Theory (IRT)
- IRT provides a way to model the probability that a person with X ability level will correctly answer a question that is ‘tuned’ to that ability level.
What does IRT incorporate and consider?
- IRT incorporates considerations of item Difficulty and discrimination
o Difficulty relates to an item not being easily accomplished, solved, or comprehended.
o Discrimination refers to the degree to which and item differentiates among people with higher or lower levels of the trail ability, or construct being measures.
Reliability estimates
Because a person’s true score is unknown, we use different mathematical methods to estimate the reliability of tests.
Common examples include: - Test-retest reliability - Parallel an Alternate forms of reliability - Internal consistency reliability o E.g., split in half, inter item correlation, Cronbach’s alpha - Interrater/interscorer reliability
Test-retest reliability
Test-retest reliability is an estimate of reliability over time
- Obtained by correlating pairs of scores from same people on administration and same test at different times
- Appropriate for stable variables (e.g., personality)
- Estimates tend to decrease as time passes
Parallel and Alternate Forms Reliability
- Parallel forms: two versions of a test are parallel if in bother versions the means and variances of test scores are equal
- Alternate forms: there is an attempt to create two forms of a test, but they do not meet strict requirement of parallel forms
- Obtained by correlating the scores of the same people measured with the different forms.
Split half reliability
Obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
Entails three steps:
- Step 1: Divide the test into two halves
- Step 2: Correlate scores on the two halves of the test.
- Step 3: Generalise the half-test reliability to the full-test reliability using the Spearman-Brown formula.
Inter-item/correlating
The degree of relatedness of items on a test. Able to gauge the homogeneity of a test
Inter-item/correlating
The degree of relatedness of items on a test. Able to gauge the homogeneity of a test
Kuder-Richardson formula 20
statistic of choice for determining the inter-item consistency of dichotomous items
Coefficient alpha
mean of all possible split-half correlations, corrected by the Spearman-brown formula. The most popular approach for internal consistency. Values range from 0 to 1.
Interrater/InterScorer Reliability
The degree of agreement/consistency between two or more scorers (or judges or raters).
- Often used with behavioural measures
- Guards against biases or idiosyncrasies in scoring
- Obtained by correlating scores from different raters:
o Use intraclass correlation for continuous measures
o Use Cohen’s Kappa for categorical measures
Choosing Reliability Estimates
The nature of the test will often determine the reliability metric e.g.,
- Are the test items homogenous or heterogeneous in nature
- Is the characteristic, ability, trait being measured presumed to be dynamic or static
- The range of test scores is or is not restricted
- The test is a speed (how many can you do in a certain amount of time) or a power test (increasing difficulty over the item)
- The test is or is not criterion-referenced (in order to pass you need to reach a threshold)
Otherwise, you can select whatever you think is appropriate.
How do we account for reliability in a single score?
- Our reliability coefficient tells us about error in our test in general
- We can use this reliability to estimate to understand how confident we can be in a single observed score for one person.
Standard Error of the Difference (SED)
The SED is a measure of how large a difference in test scores would be to be considered ‘statistically significant’
Helps with three questions (Note: test 1&2 must be on the same scale)
- How did Person A’s performance on test 1 compare with own performance on test 2?
- How did Person A’s performance on test 1 compare with Person B’s performance on test 1?
- How did Person A’s performance on test 1 compare with person B’s performance on test 2?
Standardization
is the process of administering tests to representative samples to establish norms.
Sampling
the selection of an intended population for the test, that has at least one common, observable characteristic.
Stratified-random sampling
is a sampling design that ensures every member of a population has an equal opportunity of being included in a sample.
Purposive sample
is arbitrarily selecting a sample believed to be representative of the population.
Incidental/convenience
sample that is convenient or available for use. May not be representative of the population.
o Generalisation of findings from convenience samples must be made with caution.
Process of developing norms:
Have obtained the normative sample:
- Administer the test with standard set of instructions
- Recommend a setting for test administration
- Collect and analyse data
- Summarize data using descriptive statistics including measures of central tendency and variability
- Provide a detailed description of the standardization and administration protocol
Types of Norms
Percentiles: the percentage of people in the normative sample whose score was below a particular raw score.
- Percentiles are popular because they are easily calculated and interpreted.
- Problem: real differences between raw scores may be minimized near ends of distribution and exaggerated in the middle of the distribution.
Age norms: average performance of normative sample segmented by age.
Grade norms: average performance of normative sample segmented by grade.
Subgroup: a normative sample can be segmented by any criteria initially used in selecting sample.
National norms: derived from normative sample that was nationally representative of the population.
National anchor norms: equivalency table for scores on two different tests. Allows common comparison.
Local norms: provide normative information with respect to the local populations performance on some test.
The normal curve
The normal curve is a bell-shaped, smooth, mathematically defined curve t
Standard Scores
Standard score: is a raw score converted from one scale to another that has a predefined scale (i.e., set mean and standard deviation)
Z-score
Z-Score: conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean
T-scores
T-Scores: aka ‘fifty plus or minus ten scale’ – scale has set mean = 50 and standard deviation = 10
Culture and Inference
- In selecting a test for use, responsible test users should research all available norms to check if norms are appropriate for use with your patient
- When interpreting test results it helps to know about the culture and era of test-taker
- It is important conduct culturally informed assessment