Psychometrics: reliability Flashcards
What is a reliable test?
- consistency in measurement
- the precision with which the test score measures achievement
What is reliability
- the desired consistency or reproductibility of test scores (does it give me the same accurate measurement each time it is used?)
- no test is free from error
Reliability formula
x=T+e
x- the observed score
T- the true score
e- the error
The Four Assumptions of Classical Test Score Theory
- Each person has a true score we could obtain if there was no measurement error
- there is measurement error- but this error is random
- the true score of an individual doesnt change with repeated applications of the same test, even though their observed score does
- the distribution of random errors and thus observed test scores with be the same for all people
Standard Error of measurement (SEM)
-works out how much measurement error we have by working out how much on average, an observed score on our test differs from the true score
(standard deviation)
Problems with Classical Test Score Theory
- Population dependent
- Test dependent
- Assumption of equal error measurement
Domain Sampling Model
- a central concept of Classical Test Theory
- cant ask all possible questions on a test so only use a few test items (sample)
- using fewer test items can lead to the intro of error
- as sample gets larger, estimate is more accurate
4 Types of reliability
- Test-retest reliability
- Parallel forms reliability
- Internal consistency
- inter-rater reliability
Test-retest reliability
- give someone a test and then give them the same test later on
- if scores are highly correlated, we have a good test-retest reliability
- correlation between 2 scores = co-efficient of stability
- time sampling
Issues with test-retest
- . can it be used when measuring mood/stress?
- scores increase because done them before
- if thing being measure changes?
- what if an event happens between tests administrations to change the thing being tested?
Parallel forms reliability
- 2 forms of the same test (questionnaire with different items)
- correlation between the two = co-efficient of equivalence
- item sampling
Ways to change test in parallel forms reliability
- question response alternatives are reworded
- order is changed
- change wording of question
Issues with parallel forms reliability
- what if different forms are given at two different times?
- do you give the form to the same or different people?
- what if people work out how to answer the one form from doing the other form?
- do you have two forms of the test and/or do we want to develop two forms of the same test?
Internal Consistency
-do different items within a test all measure the same thing, to an extent?
Examples of internal consistency tests
- split-half reliability
- KR20
- coefficient alpha
Split-half reliability
- test split in half and each half scores separately
- total scores for each half are correlated
advantage of split-half reliability
-only need one test (dont need 2 forms)
challenge of split-half reliability
-how to divide the test into equivalent halves
issues with split-half reliability
- by splitting test, have less items and the lower the reliability will be
- correlation changes each time depending how items are split
Spearman-Brown formula
is the solution to the problem for split tests- that each half will have reduced reliability compared to the total test)
Coefficient/Cronbach’s Alpha
- estimates the consistency of responses to different scale items
- takes the average of all possible split-half correlations for a test
What do coefficients results mean? Cronbach’s A
0- no consistency in measurement
1- perfect consistency in measurement
what level of reliability is appropriate? Cronbach’s A
- 7 - exploratory research
- 8 - basic research
- 9 - applied scenarios
Cronbach’s alpha can be affected by
- multidimensionality
- bad test items
- number of items
Inter-rater reliability
- measures how consistently 2 or more judges agree on rating something
- by correlating raters scores
Cohen’s kappa
-2 judges/raters
-ranges from 1 (perfect agreement) to -1 (agreement less than would be expected by chance)
>0.75 - excellent agreement
0.4-0.7 - satisfactory
Fleiss’ kappa
for 2 or more judges/raters
Intra-class correlation (ICC)
used for inter-rater reliability when rating interval and ordinal measurements
ICC vs COhen/Fleiss kappa
- ICC for continuous data (interval and ordinal)
- kappa for observations in a category (nominal/categorical data)
SEM calculation
SEM= S(sqrt 1-r)
s- stdev
r- reliability of test
confidence intervals using SEM
- z score for 95% confidence interval= 1.96
- lower bound = x- 1.96*SEM
- upper bound =x+ 1.96*SEM
Factors influencing reliability
- number of items in scale
- variability of the sample (better with wider population)
- extraneous variables (testing situation, ambiguous items, unstandardised procedures, perceived demand effect)
how to improve reliability
- item analysis
- Use identical instructions
- Eliminate questions that evoke inconsistent responses
- Cover entire range of the dimension
- Clear conceptualization
- Standardization
- Inter-rater training
- Use more precise measurement
- Use multiple indicators
- Pilot-testing