Ch 4 - Reliability Flashcards
Reliability
Is based on the consistency and precision of the results of the measurement process
= trustworthiness
Measurement error
• Any fluctuation in scores that results from factors related to the measurement process that are irrelevant to what is being measured
*Measurements are ALWAYS subject to some fluctuation/error, but we want to limit it as much as possible
True scores
the hypothetical entities that would result from error-free measurement (do not actually exist)
Not calculated the same way in individual and group scores
T is the value you would obtain if you were to administer the test to an individual an infinite number of times (without practice effect) and average all those scores
Individual true score
average score in the hypothetical distribution that would result if the person took a test an infinite number of times
Observed scores
the scores that individuals actually obtain when taking a test
Composed of:
True Score + Error Score
Sample (or pop) variance is composed of…
true variance + error variance
The reliability of scores increase as the error component decreases
How can we calculate a reliability coefficient (rxx) using the variance of a sample
rxx = true variance over total variance (result is error variance)
If all the test score variance were true variance, score reliability would be perfect (1.0)
Why is it an error to say that a test is reliable? Which factors can influence reliability?
It’s equivalent to saying that it will be reliable for every use, at every time, in every respects (which is not true)
This is why we consider that reliability is about scores, not about tests
Many factors can influence the reliability of a score
• Test taker (fatigue, unmotivated, mood, drugs, etc)
• Environment of test (room, temperature, noise, etc)
• Others
Why is the reliability of scores variable and not fixed?
When the score data is obtained from a large sample, in standardized conditions, the measurement errors resulting from this will be considered to be relatively small and cancel each other out in the individual scores (in this case, reliability will still vary from sample to sample)
But the extent to which possible sources of error in measurement enter into any specific use of a test must be taken into account each time a test is used, because some factors may vary
*The judgements about what is considered a source of error needs to be done in relation to what the test is trying to assess - different conditions could be interpreted differently (ex: if noise is intentionally used to distract the test takers or if it happens in the lab by mistake and distracts the test takers) - therefore it will vary at different uses of the test
3 sources of error in test scores
- The context in which the testing takes place (administrator, test scorer, environment, etc)
- The test taker (carelessness, etc) (can be difficult to eliminate)
- Specific characteristics of the test itself
- The test taker (carelessness, etc) (can be difficult to eliminate)
Consistent error
(for example, a scale can systematically weigh everyone with 2 extra kilos)
estimates of reliability may fail to detect this kind of error - which also affects the validity of the measurement
Interscorer/Interrater Differences
Label assigned to the errors that may enter into scores when the element of subjectivity plays a role in scoring a test
Can happen even if:
• The scoring guidelines are clear and well-explained
• The scorer are conscientious in applying the guidelines
It does not imply carelessness from the scorers
Scorer Reliability
Scorer reliability (AKA inter-rater reliability, AKA interscorer reliability)
Method for estimating error due to interscorer differences
Having at least 2 individuals score the same set of tests
The correlations between the sets of scores obtained are indications of scorer reliability
• Measures the degree to which score positions stay the same over 2 raters, NOT whether raters give the same score
High and positive = the error of scorer differences is <10%
• Symbol for inter-rater reliability - r (established by the prof, there is none specified in the book)
Time Sampling error
Refers to the variability inherent in test scores as a function of the fact that they are obtained at one point in time rather than at another
ANY construct/behaviour is subject to fluctuate from time to time
Some constructs/behaviours are less subject to change than others
In the realm of personality:
Traits - more enduring
States - fluctuating/temporary
Some cognitive abilities (like attention) may also be more vulnerable to change
Test-Retest Reliability
Test-retest reliability
Giving the same test on 2 occasions to account for time sampling errors
• The correlation between the scores is the test-retest reliability or stability, coefficient (rtt)
○ Index of how much scores are likely to fluctuate due to time sampling error
• The time interval between the administrations has to be specified too - no specific interval can be suggested because it can change based on various factors
* That interval should be selected with purpose - should be consistent with the theory and the intentions of the test (what it's supposed to measure) * Attrition, practice, mood, etc could all influence the scores between time 1 and 2
Content sampling error
Trait-irrelevant variability that can enter into tests scores as a result of fortuitous factors related to the content of the specific items included in a test
• When the content of a test either favors of disadvantages some test takers, for reasons outside the test developer’s control
Ex: exam that went over only 2 of the 3 chapters - if some students focused on the 1 chapter that is not in the exam it’s unfair to them
Alternate-Form Reliability
Split-Half Reliability
Alternate-Form Reliability
Intended to estimate the amount of error in test scores that is attributable to content sampling error
• Two forms of the test (same purpose but different content) are administered to the same subjects
• Alternate-form reliability (r1I) coefficients are then obtained (Pearson correlation between the 2 scores that each examinee obtains)
• Chance/random factors are unlikely to affect participants in this case
In the book, the coefficient is designated by r11
Split-Half Reliability - what does it estimate?
Administer the test to a group and create two scores by splitting the test in half
* Estimates content sampling error * Interitem inconsistency - only up to a certain point, since it evaluates reliability between 2 halves of the same test, not between each items • This method is a way to estimate content sampling error in test for which NO alternate form is available ○ Which is true for most tests - there are little alternate forms tests
How can we split the test for split-half reliability? What does it depend on?
• How to split? Depends on
○ Systematic differences across test items
§ Ex: increasing difficulty, spiral omnibus format (items pertaining to certain variables alternate in the same order for the whole test)
○ Test performance depends primarily on speed
§ Ex: clerical tests where you need to find the mistake in series of characters as fast as possible; time limit is set so that most won’t finish the test
There are various ways to split the test in half (even-odd, half-half, quarters, etc)
Spiral omnibus format
(items pertaining to certain variables alternate in the same order for the whole test)
How is the split-half reliability coefficient calculated? What adjustment do we need to make to it, and why?
the split-half reliability coefficient (rhh) is calculated with the correlation between the 2 halves of the test
Then, Spearman-Brown formula is applied to rhh to obtain an estimate for the full test, which will INCREASE the value of the coefficient (to account for both halves)
rS-B = (2rhh) / (1+rhh)
Interitem inconsistency
Refers to error in scores that results from fluctuations in items across an entire test, as opposed to the content sampling error emanating from the particular configuration of items included in the test as a whole
• Can result from: ○ Many sources possible: ○ Content sampling errors ○ Content Heterogeneity • Statistically, it will be visible with a low inter-item correlation --- degree to which responses to individual items maintain their positions during the whole item set - this method is NOT at the level of scores for the whole test, but for the level of individual item scores • Ex: those who get item 17 correct also generally get item 92 correct (idem for those who fail)
Content Heterogeneity
Results from the inclusion of items/sets of items that tap content knowledge or psychological functions that differ from those tapped by other items in the same test
Cannot be considered a source of error if the test was intended to be heterogenous
—
Heterogeneity of item content across one scale within a particular test
• If responses across the individual item are not consistent, it’s hard to argue that they come from the same domain
Internal Consistency Measures - why can’t we use the split half reliability coefficient to measure this?
Are statistical procedures designed to assess the extent of inconsistency across test items
• Split-half coefficients can do that to some extent, BUT a test can be divided into so many ways that the coefficients will vary each time ○ Solution 1: an odd-even split ○ Solution 2: formulas that take into account interitem correlation § Kuder-Richardson formula 20 (K-R 20) § Coefficient alpha (AKA Chronbach's alpha)
Name the 2 factors that make the magnitude of the Kuder-Richardson formula 20 (K-R 20) and Coefficient alpha (AKA Chronbach’s alpha) vary
- The number of items in the test
- The ratio of variability in test taker’s performance across all the items in the test to total test score variance
Indeed, their magnitude will be higher as:
• Number of item increases
• The ratio of item score variance to total test score variance decreases
• BOTH formulas require a single administration of a test to a group
Conceptually, the estimates of reliability that the Kuder-Richardson formula 20 (K-R 20) and the Coefficient alpha (AKA Chronbach’s alpha) produce are similar to…?
Both formulas produce estimates of reliability that are equivalent (conceptually) to the average of ALL the possible split-half reliability coefficient we could obtain if we split the test in all its possible ways
What type of error do the Kuder-Richardson formula 20 (K-R 20) and Coefficient alpha (AKA Chronbach’s alpha) represent?
• They represent an estimate of content sampling error and content heterogeneity
What other techniques can be used to evaluate test homogeneity
Factor analytic techniques
Time sampling and content sampling error combined can also be evaluated with..
Delayed Alternate-Form Reliability
Delayed Alternate-Form Reliability
Good for estimating time and sampling error in a single coefficient
Can be calculated when 2+ alternate forms of the same test are administered on 2 different occasions (separated by a time interval), to 1+ groups of people
If the time interval is small: mostly assessing content sampling
If the interval is larger: assessing both content and time sampling
Practice effects + myth about rtt
Practice effects: increase in performance due to repeated exposure to the test items (ex: taking the test twice)
• More significant with small intervals
• Procedures that require +1 trial are susceptible to practice effects
• Must be taken into account when relevant
Common myth about practice effects and the test-retest method: practice effects lower test-retest reliability coefficients
• Test-retest involves the EXACT SAME test being given twice - if the interval is short, there might very well be practice effects
• The myth is NOT true; because if practice effects are constant (ex: all people improve by approx. The same amount) throughout the sample, it will tend not to lower the rtt coefficient
○ BUT if practice effects are differential (vary), then yes it may lower rtt
What are the 2 times at which test reliability matters most to a test user?
○ The stage of test selection
○ Test score interpretation
Name the 4 steps to consider reliability in test selection
- Determine the potential sources of error that may affect the scores
- Examine the reliability data available on the instruments of choice, as well as the types of normative samples used for this data
- Evaluate the reliability data in light of other factors (time, cost, validity, etc)
- All other things being equal, choose the test that promises the most reliable scores
There are no fixed rules that apply when selecting a test - it ALWAYS depends on the circumstances
Name other aspects relating to reliability that must be taken into account when choosing a test (4)
- Scoring involving subjective judgement (scorer reliability)
- Possible time sampling error and practice effects (when evaluating scores over time)
- High-delayed alternate-form score reliability, when the test includes people being tested more than once
- The desire for homogeneity across the entire test (K-R20 or alpha coefficient)
What one will look at the most will depend on the intentions for using the test
Why should we look at the composition of the sample used to calculate the reliability of a test?
It must be taken into account that the reliability coefficient shown in test manuals and cie are ONLY for the samples that were used by the test authors
• Therefore, small differences in coefficients between different tests do not matter as much as other considerations
○ For tests intended to be used in individual assessment, the sample composition IS very important
○ Overall, the higher the coefficient the better (usually over .80 is best)