lecture 2- Reliability Flashcards
reliability is a ______ property of a test
-explain
-who is reliability important for and why
If a test is not reliable it will never be valid; i.e. reliability is a
necessary (but obviously not sufficient) condition for validity
Reliability is particularly important for applied psychologists
(clinical psychologists, clinical neuropsychologists, educational
psychologists) as they deal with individual cases
-
what is a reliability coefficient
Reliability coefficients tell us how much of the variability
in scores on tests is true variability (i.e., signal) and how
much of it is measurement error (i.e., noise)
what is
- true variability
- measurement error
-If a psychological test has a reliability coefficient of (say)
0.8, then 80% of the variability in scores is true variability
(i.e., the test is picking up real differences in the construct
being measured)
It follows that 20% of the variability in scores reflects measurement error – i.e., noise in the instrument
something that will affect the performance
reliability coefficient
The reliability coefficient can be seen as a signal-to-(signal plus
noise) ratio
Reliability (i.e.,r11 ) =true variance /Total variance
You will often see the reliability coefficient denoted as r11 or
rxx because it can be seen as the test’s correlation with (a
strictly parallel version of) itself – there is always
measurement error so the correlation is not perfect
why reliability is important- what does it allow for
Reliability allows us to quantify the confidence we have in our
test results and allows us to assess whether differences
between an individual’s scores are liable to reflect true
differences in ability or may have simply arisen by chance
(i.e., measurement error)
can we reify a test score ?
-reliability coefficients
Psychologists are often warned not to reify a test score: it is
only an estimate of an individual’s true ability level or mood
level etc
Reliability coefficients allow us to form confidence intervals
on scores to help remind us of the above (we will cover this
later)
what happens if we ignore reliability of tests
-chapman and chapman 1973 study
Furthermore, as much of clinical practice is concerned with
differences between an individual’s abilities, a failure to consider the
reliability of measures can lead the psychologist astray
Chapman & Chapman (1973) provided a classic illustration of
artefacts arising from differences in reliability
◦ Schizophrenic patients were compared to a healthy control sample on
two tasks
◦ The schizophrenic sample appeared to have a severe deficit on only one of the tasks (abstract reasoning)
◦ Was in fact the same task but one version rendered less reliable (by
shortening the test)
(they used a short version if the test and so the test was not that reliable) - the test also for the schiz group was shortened in half
how high should reliability coefficients be
There is no absolute rule (will depend on purpose) but various
standards have been proposed:
◦ Nunnally & Bernstein (1994) take a hard line and propose that
reliability coefficients should be above 0.90
Others are less demanding:
◦ Sattler (2001) suggests that tests with reliabilities of 0.70 and
above should be considered to be “reliable”
◦ Similarly, Cicchetti (1994) suggests tests with reliabilities below
0.70 should be considered “unreliable
can reliability be too high? high reliability as a problem
-give an example
Yes:
if we are trying to measure a broad, multifaceted, construct
then a very high reliability may indicate a problem (Boyle, 1985)
Suggests we’re not measuring the whole concept
Take example of an anxiety measure:
- We could ask people ten different ways about whether they
experience muscle tension (a symptom of anxiety)
-The “measure” would be very reliable but would not be a good
measure of anxiety itself - anxiety is multifaceted (the test just asks how tense they feel- this is just a symptom of anxiety but doesn’t necessarily measures anxiety itself reliably
how can we decide if a test is reliable
- Cronbach’s Alpha
- Test-retest reliability
To be considered reliable a test should provide a consistent
measure
what is Cronbach’s alpha
-used when
-determined by
-what does it indicate
-used in questionnaire type tests
Cronbach’s alpha is determined by:
(a) the number of items in the test
(b) the size of the correlations between the items
Longer tests are more reliable
Tests in which the items have higher correlations with each
other are more reliable
You don’t need any maths to see why that makes sense
-reliability and test length
-vocabulary test
Take the example of a Vocabulary test
If we use only, say, 4 items the test is not going to be very reliable
There are an enormous number of words out there and we will not be
able to sample them at all well with only 4 items
Some people will, by chance, do much better on the particular 4 words than they would if we tested their vocabulary for all words
Equally, others will, by chance, do worse than their real overall level of
vocabulary knowledge However, if we up the number of words substantially, these chance
advantages or disadvantages will even out
are all longer tests reliable?
longer tests will be more reliable only provided
other things are equal
Suppose a psychologist is developing a test and carefully
selects items they think will be suitable
If the reliability is disappointing, simply throwing in a bunch
of additional poor items (items that are not closely related to
the other items or have ceiling or floor effects) will not help
much
longer tests are more reliable provided that the items in the longer test are as good (as
highly correlated with the other items) as the shorter version
how can psychologists save time and shorten teste/ short form tests
psychologists are always
looking for ways to save time and try and develop short-forms of tests
Sometimes this can be done with only a marginal lowering of reliability because poor items (e.g., items that are not highly correlated with the other items)
are selectively dropped
reliability (cronbach’s alpha) is a function of…
reliability (Cronbach’s alpha) of a scale is a function of the correlation between items and the number of items
designed to measure the same underlying construct. It evaluates how closely related the items are as a group.
Reliability coefficients for the WAIS-IV
-the reliability of a composite is a function…
The reliability of a composite (an Index or IQ in this case) is a
function of the reliability of components (subtests) and the
correlation between the components
The reliability of a composite score, such as an index or IQ, indeed depends on both the reliability of its individual components (the subtests) and the correlations among those components.
do composites have superior reliability to the components
Composites will always have superior reliability to the
components they are derived from if the components are
correlated (and they always are)
Can see this when compare the reliabilities of the WAIS-IV
subtests with those for the Indexes
reliability coefficients for WAIS - IV IQs and indexes
The reliability of WAIS-IV Indexes and FSIQ are uniformly
excellent – among the highest of any psychological instrument
In case of FSIQ (in both US and UK), r11 is 0.98 so 98% of the variance in test scores is true variance and only 2% is measurement error
reliability for processing speed is ______ than others
why
a bit lower
-in part because it is a composite made up of only two components (coding and symbol search)
what is temporal stability
-how is temporal stability tested
Temporal stability refers to the extent to which a measure
yields consistent scores over time, i.e. stability coefficients
allow us to gauge extent to which performance is affected by
day to day fluctuations / differences in mood, testing conditions
-refers to a consistency if a measure over time
temporal stability is assessed using the test-retest method
the temporal stability or test retest reliability of a scale is simply….
the correlation between scores at test and retest
why is it important to set an appropriate interval between administrations?
-why do you need to avoid inflating the estimate
Normally the interval between administrations is set so it is
unlikely that true change has occurred in the underlying ability
Against that must be set the need to avoid inflating the estimate of stability due to a teste’s memory for their previous answers
higher
some of chance fluctuations on components will cancel each other out
Temporal stability of FSIQ and Indexes are highly satisfactory
– again among the highest of any psychological instrument
Temporal stability of mental tests is generally very
impressive: e.g. Deary et al. (2000) e found a (corrected)
correlation of .73 between an IQ test administered at age 11
and again at age 77 (66 year follow-up)
what are practice effects on cognitive tests
A psychologist often wants to know if an individual’s cognitive abilities have genuinely improved (e.g., as they recover from a head injury, or as a result of a psychological,
pharmacological or surgical intervention etc)
Similarly, a psychologist often wants to know if an individual’s cognitive abilities have genuinely deteriorated
(e.g., as a result of a degenerative condition, or as an unfortunate consequence of surgical intervention etc)
A complication is that there are practice effects on most
cognitive tests
Practice effects on cognitive tests refer to the improvements in test scores that result from repeated exposure to the same or similar assessments
May exaggerate or give false impression of recovery /
improvement
May mask a deterioration in functioning
WAIS IV has no alternative tests so the same test has to be administered if retetsing
-would alternate forms abolish practice effects?
-if a test had high test retest reliability, does this mean there will not be practice effects?
High test-retest reliability does NOT mean an absence of practice effects
example of practice effects
-graph
To illustrate, here is an example where the test-reliability is 1.0 (i.e., scores at test and
retest are perfectly correlated)
However, everyone improved by 15 points (i.e., there is a large practice effect of 15 points
-example: this case * scored 30 at test but 45 at retest (have they become familiar to the test-is it an improvement)
practice effects are fairly substantial for some WAIS-IV indexes
-which subtests do they particularly mark
effects on perceptual reasoning , working memory ,processing speed
Particularly marked on visuoperceptual / psychomotor subtests
Practice effects are over 1/3rd of an SD for overall IQ, Perceptual
Reasoning, and Working Memory
On Processing Speed (PS) the practice effect is over 2/3rds of an SD
(a massive effect)
Perhaps counterintuitively, an identical score on PS at retest
would therefore be a cause for concern!
are practice effects extreme in clinical settings?
Practice effects are not liable to be as extreme with more clinically realistic
retest intervals
practice effects can still be detected after ___
a 7 year gap
Why is it important for psychologists to be aware of practice effects in cognitive testing?
Understanding Variability: Practice effects can vary across different tests, affecting interpretation.
Informal Adjustments: Psychologists can informally factor in practice effects when interpreting a person’s scores.
Formal Methods: There are statistical methods available to account for practice effects in analyses.