Reliability Flashcards
Classical Test Theory
definition & formula
Measurement theory of how test scores relate to a construct (or concept that the measure is trying to get at).
X = T + e
X: observed score
T: true score
e: measurement error
CTT Error
Assumes random error of measurement, not systematic
3 CTT Assumptions
- Expected error (e) is 0 = is random.
- T and e are not correlated.
- Error at time 1 is not correlated with error at time 2.
Path Diagram Element: Box
Observable variable
Path Diagram Element: Circle
Latent (unobservable) variable
Path Diagram Element: single-headed straight arrow
Regression path; one variable influences the other
Path Diagram Element: curved double-headed arrow
Covariance (association) between two variables; ***unstandardized, so depends on scales of the variables (as opposed to correlation, which is standardized, and can only be between -1 and 1)
Index of Reliability
- Square root of rxx
- Factor loading
- Estimate of the correlation between true scores and observed scores
Coefficient of Reliability
- rxx
- Estimated correlation between X1 and X2
- Association between a measure and itself over time (or with another measure)
Test-Retest Reliability
- For a stable construct in which the correlation between T across time is 1, we estimate the reliability as the correlation between observed scores at time 1 and time 2
- The extent to which the time 1 and time 2 scores do not correlate (distance from 1 or -1) is due to a measurement error (rather than a change in the true score)
Systematic Error
- Can be either positive or negative
- Influence consistently for a person or sample; same value every time
- Affects mean of scores, not variability of scores = biased estimate of average
- Decreases accuracy of group-level and individual scores
Random Error
- Expected value = 0
- Errors occur due to chance, do not have consistent effect on individual/sample
- Affect variability (noise around the mean), do not affect the mean
- Large number of observations cancels out random errors
- Group-level means are accurate, but individual scores are less precise
Within-Person Random Error
No person or group level bias; means approximate T
Within-Person Systematic Error
Positive bias for individual
Between-Person Random Error
Person-level bias, group approximates T, group variance is inflated
Between-Person Systematic Error
Increases bias for each person, group mean is higher than T
Reliability
- Consistency, repeatability
- equal to (true score variance) / (observed score variance)
- Can only be estimated because we only have observed scores, not T or e
- Coefficient of reliability: inversely related to measurement error; depends on variance of scores
- Increases with greater # of items (because we’re averaging)
- Shortcut: test many people over time
Standard Error of Measurement
- Estimates the extent to which an observed score deviates from T (true score)
- 95% of the time, T is expected to fall within +/- 2 SEoM
- Higher reliability = lower SEoM
Test-Retest Reliability & Types
- consistency of scores across time, typically 2 weeks
1. Relative (coefficient of stability, dependability)
2. Absolute (coefficient of repeatability)
Coefficient of Stability
- Relative measure of test-retest reliability
- Pearson’s correlation between T1 and T2 over time (days/weeks)
Coefficient of Dependability
- Relative measure of test-retest reliability
- Pearson’s correlation between T1 and T2 immediately (minutes)
Coefficient of Repeatability
- Absolute measure of test-retest reliability
- Reflects consistency of scores across time by defining a range in which 95% of score differences are expected to be
- Higher CR = greater unreliability
- Smaller CR = more consistent scores = stronger reliability
- Uses a Bland Altman Plot
Bland Altman Plot
- Used to demonstrate (absolute test-retest) coefficient of repeatability
- Bias (aka systematic measurement error) plus and minus the CR define the upper and lower limits of agreement (LOA)
- —Bias = mean difference between T1 and T2 for all subjects
- ———If closer to 0, means stronger absolute test-retest reliability at group level
- X-axis: mean of scores across time
- Y-axis: difference between T1 and T2 scores
- Line of identity: y=0, meaning perfect consistency across time
Inter-rater Reliability
- Consistency of scores across raters
- Uses intraclass correlation for continuous data
- Uses Cohen’s kappa for binary/categorical variables
Intra-rater Reliability
Consistency of scores within the same rater
Parallel-Forms Reliability
- Consistency of scores across two parallel forms (two equivalent measures of the same construct)
- Can be immediate or delayed
- Coefficient of equivalence: Pearson’s correlation, ratio of true score variance to observed score variance
- Controls for specific error (error that is particular to a specific measure)
- Assumes that parallel forms are equivalent, which requires same T and variability
Coefficient of Equivalence
- Measure of parallel-forms reliability
- Pearson’s correlation, ratio of true score variance to observed score variance
Internal Consistency
- Consistency of scores across time
- Necessary but insufficient for unidimensionality
- Use split-half reliability (Spearman-Brown prediction, Cronbach’s alpha, Omega)
Split-Half Reliability
- Used to measure internal consistency
- Randomly take half of items on a measure and relate scores to the other half
- But CTT states that fewer items = less reliable…
- So Spearman-Brown prediction formula predicts reliability after changing test length
Cronbach’s Alpha
- approximately equal to mean reliability estimate of all split-halves
- Affected by:
- —-# of items; more items = inflates alpha
- —-Variance of item scores; if low variance/SD, it underestimates internal consistency
- —-Violations of assumptions (that items equally relate to construct (tau equivalence) and that scale is unidimensional)
- Omega is better option
Types of Omega & Uses
- Used to estimate reliability of split-halves; better option than Cronbach’s Alpha
- Omega total for continuous, unidimensional data
- Hierarchical omega for continuous, multi-dimensional data
- Categorical omega for categorical data
Reliability Type Rankings
delayed parallel-forms< test–retest, inter-rater < immediate parallel-forms, internal consistency, intra-rater
Why High Test-Retest does not mean Reliability?
Just because a measurement has high level of test-retest reliability, doesn’t mean that it actually has reliability because systematic error does not reduce stability or repeatability coefficients; actually artificially increases stability in within-person!
Generalizability Theory
- Alternative to CTT bc it views measurement error (facets) as conditions causing a certain score
- Examines extent to which scores are consistent across a specific set of conditions
- Takes into account multiple sources of error (facets) simultaneously
- “UNIVERSE SCORE” = person’s true score across all conditions in universe
- Measures reliability, validity via GENERALIZABILITY COEFFICIENT (relative)