Reliability and coefficient alpha Flashcards
What is reliability?
Reliability is the desired consistency or reproducibility of test scores?
- A measure of the extent of error present in a test.
- The degree to which a test produces consistent results under similar conditions
Do we assume there is always some error in measurement?
Yes and that error is random.
What do we assume leads to the differences in a person’s score on a test?
Measurement error. It is unlikely that a person’s true score will change every time they take a test.
What do we expect the distribution of scores to be for a test?
Normal distribution.
What four assumptions underlie classical test theory?
- Each person has a true score we could obtain if there was no measurement error.
- There is measurement error, but this error is random.
- The true score of an individual doesn’t change with repeated applications of the same test even though their observed score does.
- The distribution of random errors (and thus observed test scores) will be the same for all ages.
What is the domain sampling model?
- It is another central concept of classical test theory.
- If we construct a test on something, we can’t ask all possible questions, so we only use a few test items (a sample)
- Using fewer test items can lead to the introduction of error. We need to determine whether the test items adequately sample the domain or construct.
What is the point of reliability analysis?
Reliability analysis is conducted to ascertain how much error we would make by using a score from a shorter test as an estimate of someone’s true ability.
What are three things to note regarding reliability analysis?
- Reliability= variance of observed score on short test/variance of true score.
- Observed test scores should be correlated with true score.
-As the sample gets larger, estimate is more accurate - It is easy to work out reliability if we have the true score.
What can affect reliability measurements?
- Different ways of measuring reliability are sensitive to measurement error.
- We consider various sources of measurement error.
What is “Standard Error of Measurement”?
- We can workout how much measurement error we have by working out how much, on average, an observed score on our test differs from the true score.
- We know that a person’s observed score differs from their true score, and that their true score is unknowable. But we can calculate the range in which a person’s true score should fall by calculating the Standard Error of Measurement.
What is the formula for standard error of measurement?
SEM= SD(sqrt)(1-r)
- SD of the scores
- r is the reliability of the test.
What do we do once we know the SEM?
- We can use it to create confidence intervals.
- The z-score for a 95% confidence interval= 1.96
- Lower bound= x-(1.96*SEM)
- Upper bound= x+(1.96*SEM)
where x is the person’s score on the test.
What are the different types of reliability?
- Test-retest reliability
- Parallel forms reliability
- Internal consistency (split-half reliability, Kuder-Richardson 20 reliability, coefficient/Cronbach’s alpha)
- Inter-rater reliability
What is test-retest reliability?
- The simplest way to establish reliability is to administer the test or
scale to a sample on two different occasions. If the scale is reliable,
the scores at the test and retest administration should be strongly
correlated. - The correlation between the 2 scores is also known as the coefficient of stability.
- The source of error measured is time sampling.
What are the issues with test-retest reliability?
- What is the optimal length of time that should elapse between the administrations? If it is too soon, the participants may recall their answers from the first administration. If left too long, extraneous
events may influence the scores on the scale. - There are issues around using it when measuring things that are more transient like mood.
- ## What if some event happens in between first and second administration?
What is parallel form reliability?
Alternate-forms reliability requires the construction of two equivalent versions of the same test, which have items that are closely matched. Then the two forms are administered to the same set of people either at different times or at the same time.
- The correlation between the two forms is known as the coefficient of equivalence
- The source of error measure is item sampling.
How do you change the form of a test for parallel forms reliability?
- Question response alternatives are reworded.
- Order is changed (this reduces practice effects)
- Can change the wording of the question.
What are the issues with parallel forms reliability?
- A problem with the alternative forms method is that it is both difficult and expensive to produce alternate forms that are sufficiently independent and similar.
- Difficult to generate a big enough item pool.
What is inter-rater reliability?
- Measures how consistently 2 or more rater/judges agree on rating something.
- Multiple raters for measurement can improve measurement reliability.
- Could do this by correlating raters’ scores.
- This method does not factor in the number of times raters are correct by chance.
What are the 2 different calculations used for inter-rater reliability?
- Cohen’s kappa is used when there are 2 raters or judges.
- Fleiss’ kappa is used when there are more than 2 raters or judges.
- Ranges from 1 (perfect agreement) to -1.
- >0.75 excellent agreement
- 0.50-0.75 satisfactory agreement
- 0.40 poor agreement
What is internal consistency reliability?
- It asks “do the different items within one test measure all the same thing to the same extent?”
- Are items within a single test highly correlated?
- The source of error measured is the internal consistency/reliability of one test administered on one occasion.
What are the three different ways of measuring internal consistency?
- Split-half reliability
- Coefficient alpha
- KR-20 which is a special case of co-efficient alpha for the dichotomous format.
What is split half-reliability?
- A test is split in half, each half is scored separately, total scores for each half are then correlated to determine whether they yield similar measures.
- A major advantage is that we only need one test.
- A major challenge is diving the test into equivalent halves.
- Calculated using Spearman-Brown.
What are two of the major issues with split-half reliability?
- The fewer items we have, the lower our reliability (reference to the domain sampling model)
- Therefore, each half of split test will have reduced reliability compared to the total test. - Dividing the test into equivalent halves: the correlation will change for each different split. Ideally the halves should be equivalent.