Chapter 3: Reliability and Validity Flashcards
What is reliability?
The stability or consistency of a test
Tells us if the test provides good measurement
Tests are used in decision-making
Why is consistency in a test
important?
Because an inconsistent test means:
Our test doesn’t provide a good measure of stable traits or attributes.
basically we could end up making bad decisions
What is classical test theory?
A psychometric theory of measurement
Most commonly used approach to measurement in psychology.
x = T + e
x is observed score on the test
T is the true score
e is the error of measurement
Classical Test Theories equation for error
e = x - T
Assumptions of Classical Test Theory
- The mean error of measurement is 0
- True scores and errors are uncorrelated
- Errors on different measures are uncorrelated
Test-retest method
Methods of Estimating Reliability
- Test-retest method
- Parallel forms
- Split-half methods
- Internal consistency methods
Test Re-test Reliability
Give the same group of people the same test at two different points in time and then correlate the scores by computing a correlation coefficient. (reliability coefficient is thought of more as a stability coefficient)
Measures the stability of scores over time.
Pearson Product Moment Correlation (r)
The most common correlation coefficient.
It is used when two sets of scores are continuous and normally distributed.
Correlation Coefficients can vary from 0 (no relationship) to +1 or –1 (perfect positive or negative relationship)
We need a .70 or above for reliability
Test-retest methods: Error & Issues
Error is due solely to measurement error
Some issues:
– Carryover effects (interval between tests)
– Memory
– Stability of construct
– Factor of fatigue
– Reactivity (people may learn about the topic between tests)
- Motivation (ppl may not be motivated on the test when taking it a second time)
- it is difficult to determine a suitable interval of time between tests (if you wait too long the person could have changed but if you do it too soon then there would be carryover effects)
Problems with method:
– Time-consuming
– Expensive
Alternate Forms Reliability (aka equivalent forms)
Give a test to a group of people, then after a suitable amount of time give them a different form of the test, then correlate the scores.
Has to be administered either at different times or in succession
Half must take test A then B and half must take B then A
Alternate forms methods: Error & Issues
Error due to test content & perhaps passage of time (if not give back to back)
Some issues:
– Need same number and type of items on each test
– Item difficulty must be the same on each test
– Variability of scores must be the same on each test
- item sampling
- temporal aspects
Developing an equivalent alternative test can be extremely time consuming and sometimes impossible.
Example: can easily come up with equivalent tests to assess math knowledge but it is near impossible to come up with two equal tests that assess depression because a limited number of items relate to depression while there are infinite math questions you can ask.
Alternate forms methods: Bonuses
Bonuses
– Shorter interval
– Carryover effects are lessoned
– Reactivity is partially controlled
Split Half Methods
Give the test to a group of people, split it in half (usually odds and evens) , then correlate the scores
Concerned with internal consistency
Determines to what extent the test is composed of homogeneous items.
Some psychologist think tests should be homogeneous while others don’t care if they are homo or heterogeneous, they only care how well the test works
The reliability of the split half method
From the viewpoint of item sampling (not temporal stability), the longer the test the higher will its reliability be
The Spearman-Brown formula: (allows us to estimate the reliability of the entire test from split-half administration
estimated r = [ k (obtained r ) ] / [ 1 + (k − 1)(obtained r ) ]
k is the number of times the test is lengthened or shortened
For split half tests, k is 2
Split-half methods: Error & Issues
Error due to differences in item content between the
halves of the test
Some issues:
– Deciding which split-half reliability estimate to use
Split-half methods: bonuses
Bonus:
– Carryover, reactivity, and time are minimized
The Rulon Formula
Alternative to Spearman-Brown formula
estimated r = 1 − variance of differences / variance of total scores
Four scores are generate for each person: Odd items, Even items, Difference (odd – even), Total (odd + even)
If scores were perfectly consistent then there would be no variance so the “variance of differences” would be 0
R would = 1
The ratio of the two variances reflects the proportion of error variance, when this is subtracted from 1 we get the proportion of “true” variance aka the reliability
Why do we want Variability and how do we increase it?
Variability of scores among individuals, that is, individual differences, makes statistical calculations such as the correlation coefficient possible.
For greater variability increase the range of responses and create a test that is not too easy or too difficult.
The number of items – a 10-item true-false scale can theoretically yield scores from 0 to 10, but a 25-item scale can yield scores from 0 to 25, and that of course is precisely the message of the Spearman-Brown formula.
Internal Consistency Methods
Examines the items.
Give the test to a group, then compute the correlations among all items and compute the average of these intercorrelations, use a formula like coefficient alpha to estimate the reliability
Two assumptions of Internal Consistency Method
First, the interitem reliability, like split-half reliability, is meaningful only if the test is made up of homogeneous items that all assess the same thing
Second, if each item is perfectly reliable then we would only obtain two test score
Example: on a 100-item test you should get a 0 or a 100
In the real world items are not perfectly reliable of consistent with each other, which results in individual differences and variability
Types/measuring of internal consistency tests
Estimates the reliability of a test based on the number of items in the test (k) and the average
intercorrelations among test items.
– Coefficient Alpha: Calculates the mean reliability coefficient one would obtain for all possible split halves
Most widely use method of internal consistency
Only requires 1 test administration
Included in most statistical packages
Example: the response “never” is given 5 points and “occasionally” is given 4
Suggested that it should be .80 to be reliable (sometimes too harsh on short test b/c reliability increases as number of items increases)
– Kuder-Richardson Formula 20 (K-R 20) : Used with dichotomous items
(right/wrong, true/false, yes/no)
Takeaway points on reliability
– No such thing as “the” reliability; Different methods assess
consistency from different perspectives
– Reliability coefficients apply to data, NOT the instrument
– Any reliability is only an estimate of consistency
– Depends more on what one is trying to do with the test
scores than on the scores themselves
What do all the methods of reliability stem from?
all stem from the notion that a test score is composed of a “true” score plus an “error” component, and that reliability reflects the relative ratio of true score variance to total or observed score variance; if reliability were perfect, the error component would be zero.