Midterm 2 - Chpt. 5 Flashcards
Concepts (in Measurement):
- A concrete way to measure an abstract concept
Quality of operational definitions evaluated by (2):
- Reliability
– Is your measure consistent? - Construct Validity
– Are you measuring what you hope you’re measuring?
– Accuracy
How are concepts helpful?
Ways to evaluate operational definitions
Especially measurement instruments (e.g., scales, surveys, coding schemes)
Check that reliability & validity have been demonstrated
Which kinds of designs are typically most concerned with demonstrating reliability & validity?
A) Correlation/Surveys/quasi-experiments
B) Experimental designs
4 components to evaluating operational definitions:
- Reliability
- Construct Validity
- Internal Validity
- External Validity
Reliability:
- Does it measure the construct with little error?
- Is it a stable & consistent measure?
Construct
Are we measuring what we think we’re measuring?
Internal Validity
Can we infer causality?
External Validity
Can we generalize our findings beyond this group setting
Reliability - True Scores
Each participant has a true score
₋ That’s the target, but we can’t observe it
Must rely on measurement which has “measurement error” (Deviation from target)
A measure is considered reliable if it has relatively little measurement error
What’s the first concern with any measure?
Reliability is your first concern with any measure
If it isn’t measuring a thing consistently, then validity (accuracy) is not even an issue
Types of Reliability:
- Test-retest reliability
- Internal consistency reliability
- Inter-rater reliability
Test-Retest Reliability
Is a Participant’s score consistent across time
- EX: an extrovert; should be socializing, not staying in
Positive linear relationship/correlation
- Rule of thumb: min r =+ 0.80
*For relatively stable constructs
Internal Consistency Reliability
Is a P’s score on this construct similar across items that are aimed to measure related aspects of the construct
- Items = questions (“is talkative”, “is full of energy”, “is rarely shy”)
From text:
- Split-half
- Cronbach
- Item-total
Inter-rater Reliability:
How similar are a participant’s score when measured by different raters?
Relevant when behaviour is observed or texts are coded by multiple “raters”
Validity
- Are you measuring what you hope you’re measuring
- Accuracy
- Is it measuring what ‘it’ is supposed t
Components of Construct Validity
- Face Validity
- Content Validity
Face Validity
Look at each item.
Does it look like it’s assessing loneliness?
- If yes, then high face validity
Usually happens, but not a requirement of measures.
Alternative to FV:
– Give a whole bunch of items to a large group, see what predicts loneliness (don’t care Why)
Content Validity
Look at the whole measure.
Is it capturing all the important parts of what it means to be lonely? And nothing more
Theoretical question
Can be debated!
Predictive Validity
Predicts future, conceptually related behaviours
- Q: Do people with high scores on your measure at T1 go on to do relevant behaviours at T2?
Concurrent Validity
Able to distinguish between theoretically relevant behaviours
- Q: Do people with high scores on your measure behave in ways you’d expect them to behave if they were high on this construct?
- Constructs that are supposed to be related, ARE related
Types of Construct Validity - Behaviours
- Predictive
- Concurrent
Types of Construct Validity - Other Constructs
- Convergent
- Discriminant
Convergent Validity
Related to scores on measures of similar constructs
- Q: Do people with high scores on your measure have high (or low) scores on measures of related constructs (ie high correlation)?
- Do your measurements of happiness compare to other known studies of happiness?
Discriminant Validity
Not related (ie low or zero correlation) to what it shouldn’t relate to
- Q: Do people with high scores on your measure randomly vary in how much they show constructs that could be alternative explanations of what your scale is measuring?
SUMMARY - Evaluating a measure
Reliability:
- Test-retest
- Inter-rater
- Internal consistency
Construct Validity:
- Face
- Content
- Predictive
- Concurrent
- Convergent
- Discriminant
External Validity:
- Generalizability
Self-report measures:
Used to study personality traits, clinical counselling, clinical diagnoses
Better to use existing measures/scales than your own
- Will have reliability and validity data to help decide which measure to use
- Also able to compare findings with prior research using the same measure
- Can find using Mental Measures Yearbook
READINGS
Reliability - Any measurement involves two components:
- True Score
- Measurement Error
True Score
person’s actual level of the variable of interest
Measurement Error
any contributor to a measure’s score that is not based on the actual level of the variable of interest
- i.e. measuring response time with button pressing (after a beep)
Other factors: Someone reacts quickly, but presses the wrong button, then has to react again to press the right button (CONTRIBUTES TO OUR MEASURE OF REACTION TIME, BUT UNRELATED TO WHAT WE’RE TRULY INTERESTED IN)
Consider how much measurement error exists in a measure
How do measurement errors affect reliability?
Excess measurement error makes it hard to detect any true relationship between variables, but cause amounts of true variance being captured is small
Are there measures that DON’T have measurement error?
ALL measures contain some amount of measurement error
Key is to minimize this measurement error and maximize the amount of true score being captured by any particular measure
In many areas, reliability can be increased by…
making multiple observations of the same variable
A method normally found in assessing personality traits and cognitive abilities
Reliability increases as the number of items increases
i.e. a scale with ten or more questions designed to assess the same trait
How can we know how reliable a measure is?
Can assess the stability of measures using a correlation coefficient: number that tells us how strongly two variables are related to each other
Most common method of calculating correlation coefficients:
Most common method of calculation being the Pearson Correlation Coefficient
Ranges from 0.00 - +1.00 when positive, and 0.00 - -1.00 when negative
Lowercase r
0 = no relation
Closer to 1 or -1 = strong correlation
Can be positively linear or negatively linear
Test-retest reliability:
Test-retest reliability is assessed by giving many people the same measure twice
- EX: reliability can be assessed by giving it to a group of people on one day and then again a week later
- With two scores for each person, you would calculate a correlation coefficient to determine the relationship between the first/second scores
No agreed-upon cut-off for determining when a correlation is high enough to be acceptable, but some suggest a test-retest correlation of at least .80
4 issues with Test-retest reliability:
Practice effects - correlations between two scores can become inflated if people are likely to remember how they responded the first time
SOLUTION - alternate forms reliability: two different forms of the same test are administered on two separate occasions
Some constructs are relatively stable across time - like personality traits
Some aren’t, and expected to change - like mood
Obtaining two measures from the same people at different points can be difficult
Internal consistency reliability:
Examines how successful the different items in a scale are at measuring the same construct or variable
EX: think of each item as a different attempt to measure the same construct
- When people respond to similarly across these different attempts, it suggests that the measure is reliable
Most common indicator of internal consistency
Cronbach’s Alpha
Researcher calculates how well each item correlates with every other item, which produces a large number of inter-item correlations
Give info about each individual item and its relation to the total score
Items that don’t correlate with the others can be eliminated to increase measure’s reliability
While reliability can increase with longer measures, a shorter version can be more convenient to administer and also have accepted reliability
Other forms of internal consistency
Recently superior alternative - coefficient omega
Another form - split-half reality: attempts to determine the degree to which all the items in a scale are related to one another
- Split items in a scale into two parts based on some random process, then administering both halves to a group of people
- After scoring each half, can calculate a correlation to see how well performance on one half is related to performance on the second half
Inter-rater reliability:
In some research, raters observe behaviours and make ratings or judgements
- i.e. rating the amount of perceived emotion in someone
To make these ratings, raters follow a strict set of guidelines in order to make these judgements as systematic as possible
- To improve, have more than one person as a rater
Reliability of these ratings can be determined by calculating inter-rater reliability: extent to which raters agree in their observations, so if one rater gives a high score for a target, the other ratings also rate this behaviour as high
A measure can be highly reliable, but that doesn’t mean..
its measuring what it’s intending to measure
Internal Validity
Degree to which an experiment is well-designed and can support a causal claim
When it comes to operationalization, what’s most relevant is…
Construct validity: whether a variable’s operationalization is accurate in capturing the intended phenomenon
Degree to which the operationalization of a variable reflects the true theoretical meaning of the variable
Variables can be measured and manipulated in a variety of different ways, and there is never a perfect operationalization of a variable
- Thus, different indicators of construct validity are used to build an argument that a construct has been accurately operationalized and is properly measured by a particular scale
Indicators of construct validity:
□ How do we know that a measure is a valid indicator of a particular construct?
□ We gather construct validity information by examining many different forms of validity
- Helps us build an overall case for the broader category
Face Validity:
The measure appears, “on the face of it”, to measure what it’s supposed to measure - whether it appears to asses the intended variable
- Involves only a judgement of whether the content of the measure appears to measure this variable; subjective process
- i.e. pointless Buzzfeed quizzes
- Not efficient enough by itself to conclude that a measure has a construct validity
Content validity:
Evaluated by comparing the content of the measure with the theoretical definition of the construct, ensuring that the measure captures all aspects of the construct and nothing extraneous to the construct
- i.e. a construct has 3 different aspects, your scale should try to measure all 3 of these aspects
- Focus on assessing whether the content of a measure reflects the meaning of the construct being measured (like face validity)
Predictive validity:
seeing if the measure can usefully predict some future behaviour that is theoretically related
- EX: academic motivation at beginning to predict final grades at the end of term
- Grades are the standard or criterion by which we are judging the validity of our measure
- Important when studying measures designed to improve our ability to make predictions about different behaviours
Concurrent validity:
similar to predictive validity in that it examines the prediction of a criterion, but instead of a future behaviour it examines a criterion measured at the same time as the measure is administered
- One common method is to study whether two or more groups of people differ on the measure in expected ways
- i.e. psychopaths versus the general public
Another being studying how people who score either low or high on the measure behave in different situations
Convergent validity:
extent to which scores on the target measure are related to scores on the other measures of the same construct or similar constructs
Different measures of similar constructs should “converge”, or be related to one another
EX: one measure of psychopathy should correlate highly with another psychopathy measure or measures of similar constructs
Discriminant validity:
sort of the opposite of convergent validity, in that it’s a demonstration that the measure is not related to variables that are conceptually unrelated to the construct of interest
The measure should discriminate between the construct being measured and other, unrelated constructs
Scores on the mean should diverge rather than converge with these measurements of unrelated constructs
Reactivity:
potential issue with measuring behaviours is that people can behave differently when they know they’re being observed
People reacting to the act of measurement and changing their behaviour
- If this occurs, then we’re no longer learning about how someone would behave in the real world, only how they would behave when they know they’re being observed
Measures of behaviour vary in terms of their potential reactivity
How to minimize reactivity
Ways to minimize:
- allowing time for people to become used to the presence of the observer or the recording equipment
- Measure something without that person noticing or knowing (AKA non-reactive or nonobtrusive operationalizations)
- Clever ways of measuring
A variable’s levels can be conceptualized in terms of 4 different kinds of measurement scales:
- Nominal
- Ordinal
- Interval
- Ratio
How can the different measurement scales for variables affect research (pos/neg)?
- research, including conclusions drawn
- options available for establishing construct validity
- the kinds of statistical analyses that are possible and appropriate to use when analyzing your data
Nominal
no numerical or quantitative properties
Instead, categories or groups simply differ from one another
EX: country of birth - people are born in a certain country, and we can classify people based on what country they were born in
- Don’t have numerical properties, can’t be more or less than when it comes to the country of birth: levels are merely different
EX: attractive vs not attractive
- Doesn’t tell you anything about how attractive they find the person, but tells you whether or not they do find them attractive (like a yes or no)
Ordinal
allows us to order the levels of the variable in terms of rank
Instead of having categories that are different, the categories can be ordered from first to last
- EX: Olympic medals, birth order
However, we don’t know anything about the distance or difference between each element
No particular value Is attached to the intervals between numbers
Interval
difference between the numbers on the scale are equal in size
- EX: difference between 1 and 2 is the same size as the differences between 2 and 3
- Increases of two (15-17)
- Zero doesn’t indicate a complete absence of quantity; only an arbitrary reference point
- Cannot form ratios based on these numbers (Temperature example)
Ratio
like an interval scale, except it does have a meaningful absolute zero point that indicates total absence of the variable being measured
- Can enable statements such as “a person who weighs 100 kilograms weighs twice as much as a person who weights 50 kilograms)
(Not possible with other scales) - Often used with variables involving physical measures