Week 3 Reliability and Validity Flashcards
Define reliability
The degree to which a test tool produces consistent results (when measuring the same thing).
e.g. the scale which measures a consistent weight each time is considered reliable.
Define validity
The extent to which a test measures the construct it is intended to measure. e.g. a scale measures weight nothing else, an IQ test measures intelligence
Why are reliability and validity important?
- diagnosis
- assessment of ability
- treatment decisions and monitoring outcomes
- research
True or false, tests can be reliable without being valid.
TRUE
tests can consistently produce the same results but not accurately measure what you want them to.
True or false, tests can not be valid but still be reliable
FALSE
Tests cannot be valid without being reliable.
Describe classical test theory (Charles Spearman)
States that test scores are the result of:
- factors which contribute to consistency - stable under examination (“True Scores”)
- factors which contribute to inconsistency - characteristics of test taker, or situation that are not related to the characteristic being tested (errors of measurement/ confounders)
What is the formula for test theory?
X = T + e
X= obtained score T= true score e= errors of measurement
e.g. Anxiety score on test =(true) anxiety + error
List the different sources of error
Item selection
Test administration
Test scoring
Systematic measurement error
Describe the following source of error: item selection
sample of items chosen may not be reflective of every individual’s true score
Describe the following source of error: test administration
general environmental conditions e.g. temperature, lighting, noise, states/mood of the test taker
Describe the following source of error: test scoring
Subjectively scored tests e.g. projective tests and essay exams
Describe the following source of error: systematic measurement error
test may consistently tap into something other than the attribute being tests
e.g. test of introversion may actually test aspects of social anxiety without knowing
Explain domain sampling theory
Central concept in classical test theory;
With Domain Sampling, tests are constructed by randomly selecting a specified number of measures from a homogeneous, infinitely large pool.
A sample of items is reliable to the extent that the score it produces correlates highly with these true scores
Are longer tests more reliable?
Technically, yes because according to domain sampling theory, these tests will include more items from the “universe” of possible domains thus testing more aspects of an item.
What are two elements of reliability that are observed/tested?
Stability over time - extent to which test remains stable when it is administered on more than one occasion
Internal consistency - extent to which a psychological test is homogenous or heterogenous
Describe the test-retest (stability) measure of evaluating reliability.
same test administered to same group twice at two different time points.
What are considerations/ limitations for the test-retest measure?
- consider strong correlations between 2 scores
- consider time lapse between test administrations
- practice effects, maturation, treatment effects/ setting all impact scores
Test-retest is an appropriate measure for I___ and E___
It is inappropriate for S___ A__ and W__ of a b_
Intelligence and Extraversion (stable over time)
State anxiety and weight of a baby
Describe the parallel or alternative forms measure of evaluating reliability.
two forms of the same test developed; different items selected according to the same rules. e.g. alternative exam for PSY3041
please select one of the two options:
- same
- different
parallel forms have ___ distribution across scores (means and variance equal)
same
please select one of the two options:
- same
- different
alternate forms have ___ distribution of scores
different
means and variance may not be equal
What are the similarities between parallel and alternate forms of reliability?
- both are matched for content and difficulty
- stable construct required
- two tests administered to the same group (looking for strong correlations between the versions)
- influenced by changes between testing times e.g. fatigue
- additional source of error: item sampling/ slightly diff items
Describe the split half method of evaluating reliability.
test is divided into halves which are compared (randomly split, odd-even system or top vs bottom).
rationale: if scores on 2 half tests from single administration are highly correlated, scores on 2 whole tests from seperate administration should be highly correlated
- estimates of reliability will be smaller because smaller number of items
What is the purpose of the Spearman-Brown formula/ correction?
As the reliability based on the split half is smaller due to a smaller number of items, the Spearman-Brown formula is applied to estimate reliability if each half of the test was the same length as the test.
-internal consistency tested
Which reliability coefficient is used to measure internal consistency?
Cronbach’s alpha - a generalised reliability coefficient for scoring systems that are graded for each item.
- it is the mean of ALL possible split-half correlations, corrected by the spearman-brown formula
- ranges from 0 (no similarity ) to 1 (perfectly identical).
What are acceptable levels of reliability?
.70 - .80 acceptable or good
greater than .90 may indicate redundancy of items –> high reliability is important in clinical settings when making decisions e.g. decision making capacity assessment
What is the standard error of measurement? (SEM)
allows for the estimation of the precision of a specific (individual) test score.
The larger the SEM the less certain we are that the test score represents the true test score.
- Confidence intervals (CI) are often used
A way of conceptualising validity is:
A test is valid to the extent that inferences made from it are, a___, m___ and u__.
A test is valid to the extent that inferences made from it are, APPROPRIATE , MEANINGFUL and USEFUL.
- validity must be relevant to the CONTEXT and POPULATION in which you are measuring a construct.
What are the types of validity evidence?
- Face validity
- Content validity
- Criterion related validity
Predictive evidence
Concurrent Evidence - Construct validity
Convergent Evidence
Discriminent evidence
Explain face validity and how it’s measured
Does the test look like it measures the relevant construct?
- social acceptability issue
- must be an obvious link between construct and test times
- assessed using test-taker’s opinion
Explain content validity and how it’s measured
The extent to which the items on a test represent the universe of behaviour the test was designed to measure
e.g. anxiety test examines all aspects of anxiety not just affect
- sampling issue
- commonly used in vocational settings/ achievement
- assessed by experts opinion on subject manner - logical deduction
What are some issues for content-related validity
Construct underrepresentation
- failure to capture an important component of a construct e.g. depression scale which only measures emotions and thoughts but not behaviour
Construct- irrelevant variance
- measuring things other than desired construct
- e.g. wording of scale may cause people to answer in a socially desirable way
Explain criterion-related validity
extent to which a measure is related to an outcome (criterion)
e.g. high school marks used to predict university performance or relationship satisfaction used to predict separation
What is concurrent evidence? (criterion-related validity)
a comparison between measure in question and an outcome assessed at the same time
What is concurrent evidence? (criterion-related validity)
How well a test predicts performance on a criterion. Compares measure in question with an outcome assessed at a later time. e.g. ATAR score used to predict uni marks
What is construct validity?
A multi-faceted process concerned with establishing how well a test measures a psychological construct.
What is convergent evidence? (construct validity)
The degree which two constructs which should be theoretically related are actually related. e.g. relationship between low self esteem and depression
- correlate test scores between two the two measures
What is discriminant evidence? (construct validity)
a.k.a divergent evidence
aims to demonstrate that the test is unique. Low correlations should be observed with constructs which are unrelated to what the test is trying to measure
- also want to discern between similar but different constructs e.g. self-esteem and self efficacy
e. g. scores on anxiety measure should be different from depression if both are being assessed within the same test.
What is factor analysis ?(construct validity)
observe pattern using factor analysis.
- some items within a test may be highly related and form a set; whereas others may not be related to these and form a different set.
Explain the two methods of factor analysis
Exploratory factor analysis
- don’t know how many underlying constructs/ clusters will be formed
Confirmatory factor analysis
- pre-developed test e.g. DASS, analysis occurs to confirm that there are actually multiple factors being measured