reliability cont. Flashcards
summary of sources of measurement error
- Time Sampling (when the test is given)
- Item Sampling (which items were selected)
- Internal Consistency (whether the items are all measuring the trait of interest)
- Inter-rater Differences (whether different raters assign the same score)
explain item sampling
- DOMAIN = infinite pool of potential items
- Any test must sample items
- The process of sampling items introduces error, because we cannot be certain that we sampled the items randomly
explain alternate form
- Two parallel (alternate) forms of the same test are constructed
- Each one is the same length
- Equivalent (but not identical) items on each
The forms (A and B) are given to the same sample of examinees on the same day -Order is counterbalanced
-Correlation between the scores on Form A and the scores on Form B is known as the alternate form reliability
what does it mean if you have alternate form reliability
Ð Results are not due to error, test is a reliable construct, error that is involved is under control
problem and solution with alternate form reliability
- Difficulty creating alternate forms for some tests
- Single test split in half
- This is known as the split-half method.
what is split half method
- How do we split the test?
- Odd/even numbered items
- Typically do not do first half and second half because on some tests the second half is harder
- Can’t be used with speed tests (e.g., Coding)
problem with split half reliability
- Reliability is related to test -All other things being equal, longer tests have higher reliability than shorter tests (more observations on longer tests, more opportunity for the +/- errors to cancel out)
- The split-half method will underestimate the Alternate Form Reliability (and consequently overestimate the amount of error associated with item sampling)
solution to the problem with split half reliability
Spearman-Brown Formula
-Enables us to predict what the Alternate Form Reliability would be from the Split Half Reliability
Spearman Brown Formula in words
- Step 1. Calculate n (New Test Length divided by Old Test Length)
- Step 2. Multiply this by the current (“old”) test reliability
- Step 3. Subtract 1 from n (Step 1) and multiply this by the current (“old”) test reliability
- Step 4. Add 1 to Step 3.
- Step 5. Divide the result of Step 2 by the result of Step 4.
what does the general form of Spearman Brown Formula allow us to estimate
- what the reliability of the test would be if we added items to the test
- what the reliability of the test would be if we deleted items from the test
- how many items we would have to add to the test in order to achieve a desired reliability
explain reliability and test length
- When we increase the length of the test from 100 to 120 items, the reliability INCREASES from .90 to .915
- When we decrease the length of the test from 100 to 80 items, the reliability DECREASES from .90 to .878
- Reliability is related to test length
how to use SBF to Estimate how many Items to Add
Rearrange the equation and solve for n
SBF to estimate how many items to add in words
- Step 1. Subtract the current reliability (rtt) from 1
- Step 2. Multiply result of Step 1 by the desired reliability (rnn)
- Step 3. Subtract the desired reliability from 1
- Step 4. Multiply the result of Step 3 by the current reliability
- Step 5. Divide the result of Step 2 by the result of Step 4.
- Finally, multiply result of Step 5 (n) by the current test length to get the length of the test needed to get the desired reliability
caution for SBF
- The items that are ADDED or ELIMINATED must not change the test
- The added items must be selected from the same domain, i.e., they must be EQUIVALENT in terms of measurement properties to the original items
- The deleted items must be deleted RANDOMLY
what is internal consistency
- A group of items (i.e., scale) is homogeneous or internally consistent when all the items are measuring the same construct equally well
- BUT items are usually not equally good measures of the construct which introduces error
how can we be sure if the total score accurately reflects the standing on the construct regardless of which specific items were passed or failed?
- The total score is an accurate measure of the construct if the scale is internally consistent, i.e., the items are interchangeable as equally good measures of the construct
- If the scale lacks internal consistency, then we can’t be sure that the total score on the scale always has the same meaning
ways of assessing internal consistency
Inter-item correlation (Will not use)
Item-total correlation (will not use)
Formulas
- Kuder-Richardson
- Cronbach’s Alpha
four kinds of correlations
Pearson, point-biserial, phi coefficent, spearman rho
pearson correlation
continuous (interval, ratio) and continuous
point biserial
continuous and binary (nominal)
phi coefficient
binary and binary
spearman rho
ranks (ordinal) and ranks
inter-item correlation
- The correlation between each pair of items on the scale is calculated
- If there are n items on the test, then there are n(n-1)/2 unique correlations.
- These are phi coefficients (binary X binary)
The mean of these correlations is a measure of the scale’s internal consistency
item total correlation
Correlation between the item score (0, 1) and the total score on the test
- If there are n items, there will be n item-total correlations
- This is the point-biserial (binary X continuous)
The mean of these n item-total correlations is a measure of the scale’s internal consistency
when can kuder richardson be used
binary items only
explain SDi^2 in Cronbach’s alpha equation
- Most common statistic for estimating the internal consistency of a scale
- Values can range from 0 to 1
- Larger values = greater internal consistency
- Alpha is the average of all possible split-half correlations
- What’s the minimum acceptable value?
- Certainly not lower than 0.6
- Low Alpha = consider constructing smaller subscales (factor analysis)
caution with cronbach’s alpha
Very large values for alpha should be viewed with suspicion
Items could be redundant, i.e., identical,
- sampling the same behavior using different words
- I am rarely sad; I am usually happy
In this case, the inter-item correlations will also be very high (approaching 1.00)
Redundancy is LESS likely if:
- Alpha is large
- Item-item correlations are only moderately large