Reliability and Validity Flashcards
What is a psychometric test?
A standardised test that uses psychological measurements to quantify a person’s ability, strengths or characteristics
What does a psychometric test consist of?
One or more stimuli that people respond to. Responses can be overt (e.g., key press) or covert (e.g., skin conductance response)
How are psychometric tests standardised?
Data is collected from from a large number of people to establish norms, including cut-offs. This allows identification of individuals outside or inside the population norms.
What are two key properties of a good psychometric test?
Reliability (internal properties of the scale) and validity (external properties of the scale)
What is reliability?
Consistency/stability of a measure across:
- Time
- Setting
- Individuals
‘Quality’ of measurement (the extent to which you can measure a person’s ‘true’ score each time)
What is True Score Theory?
A person’s ‘true’ score reflects their genuine ability, characteristics, or potential. Every observed score is made up o the true score plus some measurement error.
What are the two sources of variability (error) in observed scores?
- Variability in ‘true’ scores within a population (individual differences)
- Fluctuations in measurement error (systematic and random)
What are the two types of measurement error of reliability?
Systematic error
Random error
How does systematic error affect reliability?
Consistently affects measurement (bias)
- Fire alarm during exam
- Race/gender bias on the test items
- Driving errors equipment has faulty controller
Systematic errors alter the mean of the data: shift
(everyone shifts in the same direction)
How does random error affect reliability?
- Random variations are not consistent across the sample (noise/chance)
- Test takers might be hungry, tired, nervous, etc., but not everyone taking the test might be in the same state
- Random variations increase the variability of the data
- (+ and - to the true score)
What is reliability determined by?
how much error variance is in the measurement
How do we calculate reliability?
variability of true score / variability of observed score
*greater denominator = less reliable
What are the types of reliability?
Test-retest
Alternate/parallel forms
Internal consistency (including co-efficient alpha, split-half/inter-item reliability, and item-total reliability)
Inter-rater agreement
What are the types of validity?
Construct validity (including content and face validity)
Criterion validity (including predictive, known groups, convergent and discriminant validity)
What is predictive validity?
The extent to which data from a measure can predict something it should theoretically be able to predict. For example, does a high extraversion score predict the number of friends someone has?
What is convergent validity?
The degree to which multiple measures of the same construct show similar results. For example, do self-report and clinician-rated measures of depression show similar scores?
What is discriminant validity?
The extent to which a measure does NOT relate to constructs it should NOT relate to. For example, a measure of verbal validity should not be strongly correlated with measures of athletic ability.
What are the two types of errors in measurement?
Systematic error (bias)
Random error (noise/chance)
What is systematic error?
Error that consistently affects measurement. Examples include a fire alarm during an exam or race/gender bias in test items. Systematic errors shift the mean of the data in the same direction fr everyone
What is random error?
Random variations in scores that don’t have a consistent effect across the sample. Examples include test-takers being hungry, tired or nervous. Random error increases the variability of the data.
What is test-retest reliability?
A type of reliability assessed by administering the same test to the same people on multiple occasions and calculating the correlation between scores. A higher correlation indicates higher reliability.
What considerations are important for test-retest reliability?
The time interval between tests is important. Longer intervals allow more room for natural change, while shorter intervals can lead to carryover effects. Test-retest reliability is only useful for stable characteristics.
What is alternate/parallel forms of reliability?
A type of reliability assessed by developing multiple versions of a test, administering each version to the same participants at multiple times, and correlating the results. A higher correlation indicates better reliability.
What is split-half reliability?
A type of reliability assessed by administering a test to participants, splitting the test in half, and computing the correlation between the two halves.
What are the problems with split-half reliability?
The way that the test is split is important, as different splits can yield different reliability values. Reliability is also reduced because the number of items in each half is smaller.
What is coefficient alpha?
A measure of internal consistency reliability that assesses the correlation of each test item with other items. The most common method is Cronbach’s alpha.
What is a problem with coefficient alpha?
It can be sensitive to test size, as more items can lead to higher reliability estimates even if the items are poorly inter-correlated.
What are the two main types of validity?
Construct Validity (how well a measure reflects the intended construct) and criterion validity (how well a measure relates to concrete, observable criteria)
What are the subtypes of construct validity?
Content validity
Face validity
What is content validity?
The extent to which the content of each test item measures the intended construct
What is face validity?
A superficial measure of whether the test appears to measure the intended construct. It can be important because it can affect how respondents approach a test.
What is known-groups validity?
The extent to which a measure differentiates between groups who should theoretically perform differently on it
What are some good practices for constructing valid test items?
Use ambiguous language
Avoid items that might cause response bias
Use a blend of positively and negatively keyed items
Ensure items assess all aspects of a construct
What is social desirability and why is it important to consider in test reconstruction?
The tendency for people to want to present themselves in a positive light. Items that no one wants to rate themselves high r low on tend to be poor items because they fail to capture the range of variability in the construct. These items can be high in reliability but low in validity.