Lec2 - Ch6 Empirical estimates of reliability Flashcards
Empirical estimates of reliability
why can’t reliability be calculated through real data?
reliability is a theoretical property of test scores, therefore it can only be estimated
(impossible to know the real true and error scores)
what are the three methods to estimate reliability?
- alternate forms
> two versions of the same test - test-retest
> two times of testing - internal consistency
> parts of the test - see picture 1
What must we take into account regarding the methods of estimation?
- no single method is completely accurate
> the accuracy of each method depends on a variety of assumptions - each method requires at least two testings
- consistency is at the basis of reliability
Alternate forms method
- two different forms of the same test
> compute the correlation between the two forms and interpret it as estimation of reliability
when can we use the alternate forms method?
- only if the two test forms are parallel
> identical true scores
> same error variance
> correlation = reliability
!! we can never be entirely sure that the two tests are parallel, but if they are “close enough”, then we could still use this method
what are the disadvantages of alternate forms method?
- we cannot be sure that the tests are parallel
- we cannot be sure that the alternate forms reflect the same psychological construct
- potential for carryover or contamination effect (due to repeated testing)
> might cause error scores on one form to be correlated with error scores on the other form
-* see picture 2*
Test-retest method
- administering the same test twice
- correlation = reliability
> the lower the correlation, the higher the effects of measurement error - sure to measure the same construct
- “stability coefficient”
when can we use the test-retest mathod?
- when the tests are parallel
- when the tests are measuring a trait-like psychological construct (stable, does not change between tests)
what are some disadvantages of test-retest method?
- carryover effects
> make the two tests situations as similar as possible - true scores can change between pre and post-test
> some psychological attributes are unstable across time (e.g. mood) - if traits change, the correlation represents the reliability and the amount of change
- many requirements (taking test twice - expensive - time-consuming - …)
test-retest method
what are three factors affecting the confidence of the assumption of stability of traits?
- kind of attribute measured (trait-like vs transient)
- length of test-retest interval
> large intervals = large psychological changes
> short intervals = carryover or contamination effects - period at which intervals occur
> e.g. different to measure knowledge depending on age
Internal consistency method
- complete only one test, once
- can be used for composite test scores
- differnt parts of the test can be treated as different forms of a test
internal consistency method
what factors affect the reliability of test scores?
- consistency among parts of the test
> if strong correlation, then likely to be reliable - test’s length
> long test is more likely to produce reliable scores than short test
what are the four internal consistency methods?
- split-half approach
- “raw alpha” approach
- “standardized alpha” approach
- omega
- see picture 3
Split-half estimates of reliability
- how to calculate it?
1- divide the items in two parallel subtests
> equal true scores and error variance
2- compute correlation between subtests
> if reliable, you find consistency between the two halves
3- enter correlation in formula (Spearman-Brown formula)
> we use formula because correlation is based only on halves
- see picture 4
difficulties of the split-halves method
- arbitrary choice of how to split the test
> all items should be highly correlated with each other (unrealistic) - not accurate for speeded tests
> split-halves reliability is almost always 1 (unrealistic)
in the table, which one is the correlation and which one is the reliability?
see picture 5
in the item-level perspective, what do the methods differ on?
- different response formats (binary vs nonbinary items)
- applicability to data for different assumptions (parallel vs less strict tests)
- different forms of data used (item variances, covariances, …)
Raw coefficient alpha
- each item is conceived as a subtest
→ consistency of all items is used to estimate the erliability of scores for the whole test - (Cronbach’s alpha)
how do you compute Cronbach’a alpha?
1- variance of scores on complete test
2- covariance between each pair of items
3- sum the covariances
4- variance and covariances in equation
- see picture 6
Cronbach’s alpha
- lower bound to the reliability
- cronbach’s alpha underestimates reliability
- real reliability is usually equal or higher than cronbach’s alpha
what does it mean to have a 0 covariance between two items?
- differences among participants’ responses in item 1 are inconsistent with differences among responses in item 2
> they don’t measure the same construct (or)
> one is heavily affected by measurement error - we would like all positive covariances among items
what does the sum of covariances indicate?
- it reflects the degree to which responses to all of the items are generally consistent with each other
- the larger the sum is, the more consistent the items are with each other
how do we make inferences from our sample to the population with Cronbach’s alpha?
- use sample’s alpha as point estimate for the population
- confidence interval (reflects that point estimate is a guess)
what is a confidence interval?
+ important facts
- represented by two values
- usually 95% C.I.
- “we are 95% confident that the alpha of the population lies between those two values”
! small samples will produce a wide and imprecise confidence interval
! negative values in C.I. are inconsistent with the concept of reliability