Ch. 5 - Reliability Flashcards
reliability
consistency in measurement (not good or bad, right or wrong, just consistent); the proportion of the total variance attributed to true variance
reliability coefficient
a proportion that indicates the ratio between the true score variance on a test and the total variance
concept of reliability - equation
Observed Score = True Score + Error
we use X to describe test score variability / reliability
variance
the proportion of the total variance attributed to true variance is
reliability
the greater the reliability…
indicates that you are capturing more true variance than “noise”
measurement error
all of the factors associated with the process of measuring some variable, other than the variable being measured
error variance
variance from irrelevant, random sources
sources of error variance
test construction (content sampled, way items are worded test administration (environment: lighting, temperature; testtaker variables: sick, bad mood; examiner-related variables: "giving away" answers with tone of voice)
more sources of error variance
computer glitches or errors in hand-scoring; testtakers may over or under report
sampling error - only contacting voters with landlines
test-retest reliability
a method of reliability. obtained by correlating pairs of scores from the same people on two different administrations of the same test. use when measuring something that’s stable over time (trait)
as the time between test administrations increases, the correlation usually…
decreases
coefficient of stability
the estimate of test-retest reliability, when the interval between testing is greater than six months
coefficient of equivalence
the degree of the relationship between various forms of a test
parallel forms (reliability)
for each form of the test, the means and variances of observed test scores are equal
alternate forms (reliability)
these don’t necessarily met the requirements of parallel forms (same means and variances) but are equivalent in terms of content, level of difficulty, etc
parallel or alternate forms relaibility
the extent to which item sampling and other errors have affected test scores on versions of the same test
how do you obtain parallel or alternate forms reliability estimates?
administer test two times with same group (like test-retest but don’t have to wait)
same problems: scores affected by item sampling, testtaker variables, etc
time consuming and expensive
estimate of inter-item consistency
degree of correlation among all items on a scale
how do you do a split-half reliability estimate?
(1) divide test into equivalent halves
(2) find Pearson r between the scores on each half
(3) adjust the half-test reliability with Spearman-Brown formula
what is a split-half reliability estimate?
obtaining reliability estimate evaluating the internal consistency of the test (no need for two firms or time elapsing).
how should split the test for a split-half reliability estimate?
not down the middle
randomly assign items
split odd-even
divide by content and difficulty
i.e. make mini parallel forms!
Spearman-Brown Adjustment
determines the reliability of a whole test from a shortened version. (half)
don’t use split-half reliability with what kind of test?
heterogeneous (measures more than one trait)
reliability usually increases as…
test length increases
alternatives to the Spearman-Brown reliability estimate (for split-half)
Kuder-Richardson (for tests with dichotemous items)
Average Proportional Distance
Cronbach’s alpha - “mean of all possible split-half correlations”
reliability coefficients range from
0 to 1. possible to get negative, but usually a mistake in data entry
measures of reliability are subject to
error. they are estimates
a reliability coefficient may not be acceptable if
it is done with the same test on a very different set of testtakers
what’s a good reliability?
like grades! .90 is an A, .80 is a B
if reliability is really high on a split-half estimate, what is likely the cause?
redundancy in test items
the more homogeneous a test is…
the more inter-item consistency it can be expected to have (duh)
split-half reliability, odd-even, Spearman-Brown formula, Kuder-Richardson (KR-20), alpha, and Average Proportional Distance are all methods of evaluating…
the internal consistency of a test
inter-scorer reliability
the degree of agreement or consistency between two or more scorers/judges/raters
if inter-scorer reliability is high,…
test scores can be derived in a systematic, consistent way by trained scorers
what are the three approaches for estimating reliability?
test-retest, alternate or parallel forms, internal or inter-item consistency
what about the nature of a test might influence reliability? (5)
homogeneous vs heterogeneous test dynamic vs static characteristics restriction or inflation of range speed vs power test criterion-referenced vs norm-referenced tests
heterogeneous vs homogeneous test
measures different factors; measures one factor/trait
traditional ways of estimating reliability are often not appropriate for what kind of test?
criterion-referenced
what kind of reliability estimate is best for a heterogeneous test?
test-retest (not inter item consistency because that will be low)
what kind of reliability estimate is best for a measurement of dynamic characteristics?
inter-item consistency (not test-retest)
power test
has a long time limit, but some items are so hard that no testtaker will get a perfect score
speed test
must be done in a certain amount of time. easy items but tough to get them all done (typing)
classical test theory believes that…
everyone has a “true score” on a test. very test-dependent, though
what are alternatives to classical test theory?
domain sampling theory
generalizability theory
Item Response Theory (IRT)
domain sampling theory
test’s reliability is an objective measure of how precisely the test measures the “domain” of the test (ex: behavior). takes issue with the true score + error = score
generalizability theory
a person’s test scores vary from testing to testing because of the variables in the testing sitaution. takes issue with the true score + error = score
Item Response Theory (IRT)
hundreds of varietys; items vary in many different ways including: Difficulty and Discrimination
what tells us how much error could be in single test score?
Standard Error of Measurement (SEM)
Standard Error of Measurement
estimates the extent to which an observed score deviates from a “true” score
the higher reliability of a test, the ____ the SEM
lower
if a person were to take a bunch of equivalent tests, scores would be…
normally distributed with their true score at the mean
confidence interval
the range or band of scores that is likely to contain the true score
95% confidence interval - what does it mean?
we are 95% confident that the true score is within +- 2 standard errors of measurement. 95% of this testtaker’s scores are expected to fall within this range on the distribution
true differences in a characteristic being measured might be from another source besides error or change from one testing to another. what might that be?
an actual difference. might be what you’re looking for in psychotherapy outcome reasearch
standard error of the difference helps you determine
if your research showed statistically significant results of something weird!
the standard error of the difference will always be ___ compared to the standard error of measurement for a score.
larger, because both include error.