Lecture 12: Reliability and Validity Flashcards
Observed vs latent constructs
Constructs (recap) = an abstract feature of interest within a population, such as intelligence, perseverance, or education
Observed constructs = constructs that can be measured directly (e.g., height, weight, age, number of visits to the gym).
Latent constructs = constructs that are measured indirectly (e.g., through observed indicators, which can be questions): attitudes, opinions (inside a participant’s head). These constructs require an operational definition, for example to determine what can be considered “succesful” or extrovert. Together, several observed indicators help us capture the underlying latent construct.
Measurement error
The difference between “true score” and lower response
E.g., the question is “Do you enjoy talking to people?”
This response can be affected by random factors (e.g., a sleepless night), or the questionnaire itself => everyone will interpret this question differently, and since extraversion is a broad term, one’s extraversion can easily be overestimated or underestimated.
Contra-indicative items
Statements whose wording is aligned with the construct are indicative.
Statements whose wording is opposed to the construct are contra-indicative OR reverse coded.
For example, if you measure deceitfulness, a question like “Honesty is the best policy in all cases” is an example of a reverse-coded question.
Sum scores
Xsum = ∑ki=1 Xi
∑ki=1 = sum of the individual item scores
Xi = all items
Every question is given a value (e.g., 1-3). After all the questions are answered, the values are added up to give a sum score
Limitations of sum scores
The total number of the sum depends on the number of items. If any person has a missing value for one of the items, they cannot get any points for that item => therefore, they will score lower, even if that may not be true in practice. This limitation can be addressed by calculating a mean score, where the value does not depend on the number of items and missing values do not result in a lower score.
However, 2 limitations remain:
1) Each item is still considered to be equally important
2) Measurement error is still ignored
3) Mean scores are somewhat better than sum scores, but not perfect
Reliability
Does the instrument consistently measure the same thing?
Test-retest reliability: are individuals’ scores similar across multiple occasions? (E.g., is your score on extraversion still the same in a month)
Internal consistency: are scores across different questions similar for the same individual?
Inter-rate reliability: do different people report the same score for the same thing?
E.g., when you ask 2 classmates to rate you on a scale of extroversion, do they answer the same?
Limitations of test-retest reliability
Learning effects: after participants have been exposed to your questionnaire once, they can react to it differently the second time
Memory effects => participant’s scores change the same because they remember the question, not because their characteristics have necessarily stayed the same
People change over time => therefore, you should find an interval that is just long enough for participants to be minimally affected by learning and memory.
Internal consistency
Measures the association among items within a test. It can be estimated using methods such as split halves or Cronbach’s Alpha.
Split halves
1) Split the test in two halves
2) Correlate scores of first half with second half
3) Apply correction to estimate reliability of entire test based on correlation between split halves
4) r’ = 2r / 1 - r
Cronbach’s Alpha
Estimates internal consistency (if you are measuring the same thing across all your questions).
Looks at the number of items multiplied by the average covariance between items, divided by the average variance of items.
If you want to calculate Cronbach’s Alpha, contra-indicative items must be reverse coded.
Alpha increases with number of items + when items are more similar (this might lower content validity if our items are very similar).
Rules of thumb:
> . 90 Excellent
.70 Acceptable
. 60 Questionable
< 50 Unacceptable
Item-total correlation
We could compute a total scale score by adding or averaging item responses. Each item ought to measure the same construct as this total score. We can then observe the correlation between each item and the total score.
Validity
Does the instrument measure what it intends to measure?
Face validity: at first glance, does the instrument appear to assess the correct construct? Is the wording clear, readable, understandable and unambiguous?
E.g., does “I enjoy swimming” measure extroversion?
Content validity: does the test cover all aspects of the construct?
Example of poor content validity: early intelligence tests. White Americans scored highest => white spread racial discrimination caused people from lower socio-economic backgrounds to score lower on these tests.
Criterion validity: is the test associated with outcomes or indicators of the construct it is designed to measure?
A scale should correlate with another validated scale, a behavioural measure of the construct (GPA correlates with intelligence), or an outcome of the construct (altruism correlates with giving to charity).
=> test of extraversion should predict number of friends
Psychometrics
Has the aim to measure an underlying latent construct using multiple observed indicators.