Test Construction: Reliability Flashcards
Reliability Coefficient:
Range and Reliability
Range: 0.0 to +1.0
Acceptable Test Reliability: r = .80 or higher
Reliability is also known as…
Consistency, Precision
*a reliable test yields consistent, dependable results
Test-Retest Reliability: Overview
Same group
Same exam repeated
r = coefficient of stability/consistency
Test-Retest Reliability: Source of error
Time Sampling Factors
Alternate Forms Reliability
aka Equivalent Forms/Parallel Forms
Same group
Compare scores on alternate/equivalent form of exam
r = coefficient of equivalence
Alternate Forms Reliability: Sources of Error
Content Sampling
Time Sampling
Test-retest reliability and alternate forms reliability should NOT with large probability of these 2 types of errors
Time Sampling
Practice Effects
Internal Consistency Reliability: Two Methods
Split-Half Reliability
Cronbach’s Alpha
Internal Consistency Reliability: Split-Half Reliability
One test
One group
Test is split into equal halves so that each examinee has two scores
r = correlation between two halves/scores
Internal Consistency Reliability: Split-Half reliability underestimation
Shorter tests = underestimation of reliability (small sample size)
*Corrected with Spearman-Brown prophecy formula
Internal Consistency Reliability: Spearman-Brown formula
r = split-half reliability if it were based on full test
“Long Spear cuts you in half”
Internal Consistency Reliability: Cronbach’s Alpha
Like split-half,
One test
One group
Looks at all items for inter-item consistency
Internal Consistency Reliability:
Cronbach’s Alpha: characteristics
Considered the lower boundary of the test’s reliability
Internal Consistency Reliability: Kuder-Richardson Formula 20 (KR-20)
Cronbach Alpha Variation for Dichotomous scores (yes/no)
“KRonbach 2.0”
two point oh = 2 for another version of Cronbach and 2 for dichotomous
Internal consistency reliability:
Sources of error
Content sampling
Heterogeneity of content
Internal Consistency Reliability:
Most Effective Application
When test measures single characteristic
When characteristic fluctuates over time
When practice effects are likely
Internal consistency reliability: when not to use
Speed test
*fixed time prevents completion of all items
Best Reliability Approach for Speed Test
Alternate Forms Reliability
Inter-Rater Reliability: Two Methods, three statistics
a.k.a. Inter-Scorer, Inter-Observer
Used when scores depend on a rater’s judgment
Two methods:
- Correlation Coefficient
- kappa (k) statistic
- coefficient of concordance (Kendall)
*Percent Agreement
Inter-Rater Reliability: When to Use Kappa Statistic (k)
Nominal or Ordinal scales
*Mnemonic: kNOw the other raters
Inter-Rater Reliability: Coefficient of Concordance
aka Kendall’s Coefficient
3 or more raters
Ranked Ratings
“Everyone ranked Kendall Jenner Pepsi Ad as horrible”
Inter-Rater Reliability: Sources of Error
- motivation
- biases
- characteristics of measuring device
- observer drift
Inter-rater reliability: Observer Drift
Error introduced raters working together influence each other’s ratings to be more similar
Factors that affect Reliability: Test Length
Smaller test = less data and more error
Factors that affect Reliability: How to Increase range
heterogenous examinees
varied degree of difficulty of items
(r is maximized when range is unrestricted)
Lowest to Highest Test Reliability
Free recall (least amount of guessing)
Interpreting Reliability: the reliability coefficient
Proportion of variability that is attributable to true score variability
e.g. r = .84
84% = variability of true score differences
16% = measurement error
Interpreting Reliability: Standard Error of Measurement
Similar to standard deviation
Useful to interpret an individual examinee’s test score
SEM factors in reliability to create a Confidence interval
Standard Error and Confidence Intervals
Add and subtract SEM to Test Score to create CI
[Same percentages as SD and same math as Z score]
E.g. Test score = 100, SEM = 10:
68% CI = +1 & - 1 [90-110]
95% CI = +2 & - 2 [80-120]
99% CI = +3 & - 3 [70-130]
Standard Error of the Difference
Compares ONE examinee’s performance on:
two different tests
(or same test at different times)
Factor that decreases Cronbach’s Alpha
Heterogeneity of content
Error: Content Sampling
Interaction between:
Examinee knowledge and different content on alternate form
Error: Time Sampling Factors
Random factors related to passage of time
- Difference in examinees’ anxiety, motivation, etc.
- Random variations in testing environment
Error: Carryover Effects
When scores are affected by multiple tests:
order of test administration (order effects)
repeated measurement (practice effects)