Reliability and Validity Flashcards
1
Q
Reliability
A
- Reliability refers to the consistency of a measure – it is essentially about whether a measure is consistent.
- Psychologists consider three types of reliability:
- Over time (test-retest reliability)
- Across items (internal consistency)
- Across different researchers (inter-rater reliability)
Reliability measures commonly take the form of correlation coefficients but there are different methods available.
2
Q
Test-Retest reliability
A
- One way to think about reliability is that a person should get the same score on a questionnaire if they complete it at two different points in time.
- A reliable instrument will produce similar scores at both points in time.
- When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time.
- Test-retest reliability is the extent to which this is actually the case.
- So the question we need to ask is: does the tool give the same measurement both times it is administered on an individual?
- The problem with this for many tools is that the second time the measure is administered it is not effectively under the same conditions.
- Someone who is scared of R would score highly on an R-phobia scale and they should also score similarly highly on the R-phobia scale if we test them a month later.
- Intelligence is thought to be consistent across time.
- For example, a person who is intelligent would score highly on an IQ test today and they should also score similarly highly on the IQ test if we test them a month later.
- Assessing test-retest reliability requires using the measure on a group of people at one time and using it again on the same group of people at a later time, and then looking at the test-retest correlation between the two sets of scores.
- This is typically done by computing Pearson’s r.
- In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
- Distance matters!
- We need to think of absolute agreement not just associations!
- Intraclass correlations (ICC) look at the absolute agreement between variables.
- ICC assesses how much of the total variance in the data is due to differences between individuals rather than error.
- ICC close to 1- Most variation comes from difference between individuals
- ICC close to 0- Most variation comes from measurement error or differences within individuals.
- ICC
- <0.5 are indicative of poor reliability,
- 0.5 to 0.75 indicate moderate reliability
- 0.75 to 0.9 indicate good reliability
- > 0.90 indicate excellent
- High test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for constructs such as intelligence and self-esteem.
- However, some constructs are not assumed to be stable over time.
For example, as mood changes over time, a measure of mood that produced a low test-retest correlation over a period of a month would not be an issue.
3
Q
Internal consistency
A
- Internal consistency is the consistency of people’s responses across the items on a multiple-item measure.
- In general, all the items on such measures are supposed to reflect the same underlying construct.
- Thus, people’s scores on a set of items should be correlated with each other.
- On a scale measuring loneliness, for example, people who agree that they feel left out of things should also agree that they feel alone.
- If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct.
On the Rosenberg Self-Esteem Scale, people who agree that they are satisfied with themselves should also agree that they have a positive attitude toward themselves
4
Q
Internal consistency- split half method
A
- Internal consistency can be assessed using the split-half method.
- This method involves splitting the items on a questionnaire into two halves with each half measuring the same elements but in slightly different ways.
- For example, the items could be split into two sets such as the first and second halves of the items or the even- and odd-numbered items.
- Then a score is computed for each set of items and the relationship between the two sets of scores is examined.
- If a scale is very reliable, a person’s score on one half of the scale should be the same (or similar) to their score on the other half. Thus, across several participants scores from the two halves of the questionnaire should correlate perfectly (or very highly)!
- The correlation between the two halves is the statistic computed in the split-half method, with large correlations being a sign of reliability.
- A split-half correlation of +.80 or greater is generally considered good internal consistency.
The problem with this method is that there are several ways in which a set of data can be split into two and so the results could be a product of the way in which the data were split.
5
Q
internal consistency- cronbachs α
A
- The most common measure of internal consistency is a statistic called Cronbach’s α.
- Cronbach’s alpha refers to how closely related a set of items are as a group.
- Extent to which different items on the same test (or the same subscale on a larger test) correlate with each other.
- Alpha coefficient ranges from 0 to 1: the higher the score, the more reliable the scale is.
- A value of +.70 or greater is generally taken to indicate good internal consistency (Kline, 1999).
- Problems with Cronbachs’ alpha
- It’s a lower bound estimate- this mean it gives the lowest estimate of reliability (i.e. it’s pessimistic)
- Assumption of tau-equivalence - the same true score for all test items (all items have the same factor or component loadings) this is unlikely and can reduce alpha estimates by up to 11%
- More questions = higher alpha:
Three‐Factor Eating Questionnaire (TFEQ) ((24),(25)) is a 51‐item self‐report questionnaire that assesses restraint, disinhibition, and susceptibility to hunger. The reliability of the total measure in the present sample was α = 0.90, and the reliability of the restraint, disinhibition, and hunger subscales were 0.77, 0.84, and 0.85, respectively.
6
Q
reliability thresholds
A
- 1 : perfect reliability,
- ≥ 0.9: excellent reliability,
- ≥ 0.8 < 0.9: good reliability,
- ≥ 0.7 < 0.8: acceptable reliability,
- ≥ 0.6 < 0.7: questionable reliability,
- ≥ 0.5 < 0.6: poor reliability,
- < 0.5: unacceptable reliability,
0: no reliability.
7
Q
McDonalds omega
A
- McDonald’s Omega works in very much the same way as Cronbach’s alpha but does not require tau equivalence, so it works even when items vary in their contribution to the total score.
The practice code this week includes code for both Cronbach’s alpha and McDonald’s Omega.
8
Q
omega total vs omega hierarchical
A
- Omega Hierarchical (ωh) : assesses the extent to which variance on a measure is due to a general factor (g). For example, an intelligence measure may have discrete factors (spatial intelligence, emotional intelligence etc.) but should also tap into a general factor (intelligence).
- Omega Total (ωt): assesses reliability for all factors (general and all other factors).
You can compare the two to get an idea of what is more important. A general factor or subfactors.
9
Q
inter-rater reliability
A
- Inter-rater reliability is the extent to which different observers are consistent in their judgments.
- Inter-rater reliability is assessed using Cronbach’s α and ICC’s when the judgments are quantitative or Cohen’s κ when the judgments are categorical.
- Inter-rater reliability would have been measured in Bandura’s Bobo doll study.
- The observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated.
- Imagine you are a interested in cultural differences in university students’ expressiveness. You could record interactions between American students and British students. Then you could have two or more observers watch the videos and rate each student’s level of expressiveness (e.g., types of facial expression, arm movement, gesticulation etc). The observers’ ratings should be highly positively correlated.
- Work at the University of Liverpool exploring terrorists under police interview
- Coded on their interactions based on interpersonal behavior circles eg ->
Two coders have to do this and we ensure that they are consistent
10
Q
validity
A
- Validity is the extent to which the scores from a measure represent the variable they are intended to.
- Essentially, validity is concerned with ascertaining if something does what it is supposed to do.
- It therefore represents the truthfulness of a measure.
- There are three basic kinds:
- Face validity
- Content validity
Criterion validity
11
Q
face validity
A
- Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest.
- It is a simple test of validity although it is largely subjective.
- Face validity is usually assessed informally. It could be assessed as part of the pilot stage.
- For instance, a researcher could ask pilot participants: “does this measure appear to measure x?”
- Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. Thus, a questionnaire that included these kinds of items would have good face validity.
- Here is the 10-item subscale of conscientiousness (Goldberg, 1992).
The highlighted item ‘get upset easily’ is not face valid for this subscale as it relates to emotional stability (neuroticism).
12
Q
content validity
A
- Content validity is the extent to which a measure “covers” the construct of interest.
- So we need to ask the question: does the measure include all necessary items to measure the concept in question?
- Content validity is assessed by carefully checking the measurement method against the conceptual definition of the construct.
- A measure of Facebook addiction should include all necessary questions to assess Facebook addiction as outlined in the literature.
If a researcher defines a behavioural addiction in line with the six addiction components (e.g., tolerance, salience, mood modification, withdrawal, relapse, and conflict), then a new measure of Facebook addiction should include at least one item reflecting each of the six elements of addiction in order to ensure its content validity.
13
Q
criterion validity
A
- Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as the criteria) that one would expect them to be correlated with.
- A criterion can be any variable that one has reason to think should be correlated with the construct being measured.
- You would expect test anxiety scores to be positively correlated with general anxiety and with blood pressure during an examination.
- You would expect test anxiety scores to be negatively correlated with exam performance and course grades.
- Imagine a researcher develops a new measure of adolescent online risk taking. Adolescents’ scores should be correlated with their participation in “risky” online activities such as speaking to strangers and gambling.
- Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “risky” activities such as snow-boarding and bouldering.
- When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).
- Concurrent validity = how well does the test correlate with other established tests, at around the same time.
- Predictive validity = how well does the test predict something in the future such as job performance or degree grade.
- Criteria can also include other measures of the same construct. This is known as convergent validity.
- For example, one would expect new measures of test anxiety to be positively correlated with existing measures of the same construct.
This is simply a correlational test.
14
Q
discriminant validity
A
- Discriminant validity is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct (i.e., the degree to which items designed to measure different constructs discriminate between each other).
- A new scale should not correlate with other scales designed to measure a different construct.
This is simply a correlational test.
15
Q
A