Ch5 Good Measurement Flashcards
Conceptual variables
(Constructs) theoretical concepts
Operational variable
How a variable will be measured or manipulated
How to operationalize conceptual variables
First define construct of interest
Then create operational definition-
Think how you could quantify the construct/ turn it into a #
3 types of measures
1) self report
2) operational measure
3) physiological measure
Self-report
+ can children do self reports?
Variable operationalized by recording peoples answers to Q’s about themselves in questionnaire or interview
In children research self-reports sometimes replaced w parent or teacher reports
Example of self-report
Diener’s 5 item scale
+ ladder of life
Both self report measures of life satisfaction
How did Ed Deiner Operationalize “subjective well-being”
+ how did most score
created a 5 item questionnaire about life satisfaction on a 7 point scale
(1- strongly disagree, 7- strongly agree)
most scored above 20 (neutral)
observational measures
behavioral measures
operationalizing a variable by recording observable behaviors/ physical traces of a behavior
example of observational measure
happiness- how often someone smiles
allergies- how often someone sneezes
can intelligence be observationally measured?
intelligence can be considered observational measures b/c people who administer the test in person are observing their intelligent behaviors (such as solving a puzzle)
physiological measure
operationalizes by recording biological data
often requires equipment to amplify, record, analyze biological data
examples of physiological measure
brain activity
heart rate
FMRI brain scanning for wins vs losses at Rock Paper Scissors
which operationalization is best?
a single construct can be operationalized any of the 3 ways, no best one
just important that different ways of measuring show similar patterns of results
which type of measure is mistakenly considered most accurate?
physiological, but it has to be corroborated with other measures
example of corroborating physiological measures with other measures
- to use FMRI scans for intelligence related to brain efficiency, first participant intelligence had to be established via IQ test (behavioral measure)
- FMRI to measure happiness could only work by first asking participants how happy they feel (self-report)
how many levels must variables have
at least two, to allow for change
how can levels of operational variables be coded?
using different scales of measurement
categorical variables (nominal variables)
levels of the variables are qualitatively distinct categories
(categorized by name only)
researchers may # levels for data entry (1-male, 2-female,) but no quantitative meaning to the numbers (1 isn’t higher than 2)
Quantitative variables
levels are coded with meaningful #’s
example of categorical variables
sex, species
example of quantitative variables
height, weight, IQ scores
Dieners scale of subjective well-being
is Diener’s scale of subjective well-being use categorical or quantitive variables? why
quantitative, because the numbers have meaning- a score of 35 is higher than 7
types of quantitive variables
- ordinal scale
- interval scale
- ratio scale
ordinal scale
’s represent a ranked order, with unequal intervals between levels
example of ordinal scale
places in a race- 1st is faster than 2nd, but we don’t know by how much
interval scale
’s represent equal distances between levels + there’s no true zero point (zero doesn’t mean ‘nothing’- 0° does not mean no temperature)
what kind of scale do most questionnaire’s use?
including Diener’s SWB
interval scales
example of interval scale
IQ scores (100 to 105 is the same distance as 105 to 110)
if there’s no true zero In interval scales, what can’t researchers say, that can be said about ratio scales?
can’t say that something is “twice” or “three times” as much as something else
ratio scale
’s represent equal intervals and there IS a truly zero point (zero means “none”)
example of ratio scale
height, distance traveled
exam scores (because zero means “nothing correct”)
reliability
how consistent results/ scores of a measure are
validity
is the operationalization measuring what it’s supposed to? - how accurate is it?
types of reliability
- test-retest reliability
- interrator reliability
- internal reliability
what do researchers do before deciding on a measure? Why?
they collect (or review others’) data before deciding how to operationalize something
in order to see if it is reliable- that it will yield consistent patterns of results
test-retest reliability
refers to whether scores are consistent every time the measure is used (time 1, time 2)
example of test-retest reliability
IQ tests should have similar results at beginning (time 1) and end (time 2) of a semester
what kind of operationalizations can test-retest reliability apply to?
self-report, observational, and physiological measures
when is test-retest reliability most relevant?
when a construct is expected to be relatively stable- it’s not expected to change over time
example of when test-retest reliability is NOT relevant
happy mood– expected to change over time
interrater reliability
refers to consistency of scores no matter who is measuring the variable
example of interrater reliability
two observers measure and record how often a child smiles during an hour- results should be consistent
internal reliability (internal consistency)
pattern of answers in self-report should be consistent no matter how a question is phrased
for what kind of measure is internal reliability relevant?
self-report scales with multiple items only
example of internal reliability
in Diener’s scale, all different q’s to measure the same construct
statistical devices for data analysis
- scatterplots
- correlation coefficient r
what kind of claim is evidence for reliability an example of?
association claim- of one time with another, one coder with another, one version of a question and another
how are correlations used to document reliability?
use head circumference to explain
test-retest: measure head twice, two different times
interrater: have two different people measure
measurements should be the same/similar with some measurement error
(self-report doesn’t apply)
what does interrater agreement look like on a scatterplot?
a slope - points are close to slope line
what does interrater disagreement look like on a scatterplot?
points are further from slope line
the correlation coefficient “r”
a single # that describes how close dots on a scatterplot are to a line drawn through them
in what ways can scatterplots differ?
- slope direction (negative, positive, or zero slope)
- strength of relationship (dots lying closer to slope indicates a stronger relationship)
how does the slope act when “r” is positive? negative?
when slope is positive, r is positive
when slope is negative, r is negative
what is the range of the value of “r”
falls between 1.0 and -1.0
what is the value of r when the relationship is strong? weak?
when relationship is strong, r is close to 1.0 or -1.0
1 indicates a strong positive slope
-1 indicates a strong negative slope
when relationship is weak, r is close to 0
using “r” in test-retest reliability- how can you tell if test-retest reliability is good or poor?
measure same participants twice, then compute value of r
if value for r is strong and positive (.5 or above) then test-retest reliability is good.
if r is positive but weak, then it means the score changed between time 1 and time 2- poor test-retest reliability
using “r” in interrater reliability - how to tell if interrater reliability is strong
two observers rate same participant at the same time, then compute r
if value for r is strong and positive (0.7 or above) interrater reliability is strong
if weak and positive, interrater reliability is poor, cannot trust observers’ ratings
negative r would indicate terrible interrater reliability
if interrater reliability is weak, what can be done?
either retrain coders or
refine operational definition
when in interrater reliability should you use ‘r’ and when should you use kappa?
use r when rating quantitative variable
if the variable is categorical, the correlation coefficient ‘kappa’ is used
kappa
the correlation coefficient used in interrater reliability
measures the extent to which two raters place participants into the same categories
works like are in that 1.0 means raters are in agreement
when is r used in internal reliability?
using r is relevant in internal reliability for measures that use multiple items (questions) to approach the same construct
how can you tell if a set of items has internal consistency?
set of items has internal consistency if its items correlate strongly with one another
in that you can average across those items to get a single overall score for each participant
cronbach’s alpha (coefficient alpha)
a correlation-based statistic to see if measurement scale has internal reliability
closer to 1.0 = better reliability- 0.7 or above is considered good
.9 or above means q’s may be redundant
if a set of items have good internal reliability, what can researchers do? what if they don’t?
if good reliability, researchers can combine items
if poor, researchers must revise items, or select only items that correlate strongly
what types of reliabilities are used for self-report?
internal and test-retest. interrater is unnecessary because there are no observers/ coders, the subject is evaluating themselves
what are reliability and validity important in establishing?
construct validity- because they show us that our chosen measure for the construct is consistent and accurate (measures what it’s supposed to)
Is head measurement as a measurement for intelligence reliable? valid?
reliable, because measurement would be consistent
not valid as an intelligence test
how can we know if indirect operational measures of a construct are really measuring that construct?
by collecting a variety of data
examples of indirect measures of happiness
- wellbeing inventory
- daily smile rate
- stress hormone levels
Can you say evidence for construct validity is or isn’t valid?
no, its a matter of degree
ask: what is the weight of evidence in favor of this measure’s validity?
subjective ways to asses validity
- face validity
- content validity
face validity
looks like what we want it to measure
example of face validity
head circumference has good face validity for hat size,
low face validity for intelligence
how do researchers check face validity?
+ example
generally by consulting experts
ex: asking a panel of personality psychologists about how reasonable Diener’s SWB scale is for measuring happiness
content validity
a way to see if our measure contains all the parts the theory says it should contain
must capture all parts of a defined construct
example of content validity
conceptual definition of intelligence contains many parts (plan, ability to reason, learn quickly, etc)
to have good content validity, an operationalization of intelligence should include items to asses each component-
this is why IQ tests have sections
empirical ways to asses validity
- criterion validity
- convergent validity
- discriminant validity
what is the point of empirical ways to asses validity
to make sure measurement is associated w something it theoretically should be associated with
criterion validity
whether the measure is related to a concrete behavioral outcome it should be related to
example of criterion validity
how to predict aptitude of job applicants-
when there is a strong correlation between sales performance and the test, criterion validity is high (close to the slope)
which type of measure is criterion validity important for? why?
criterion validity is especially important for self-report measures
because the correlation can indicate how well people’s self reports predict their actual behavior
what is criterion validity assed using?
typically represented by a correlation coefficient
but can also be assessed with a known-groups paradigm
known-groups paradigm
examines whether scores on the measure can distinguish among a set of groups whose behavior is already well understood
example of known-groups paradigm
salivary cortisol levels-
measure those about to give speech, those in audience
we know giving a speech is stressful
if salivary cortisol levels are a valid measure of stress, it will be higher among those giving the speach
BDI
beck depression inventory- 21 item self-report scale asks about major symptoms of depression
how was the known- groups paradigm used to test the BDI?
known-groups paradigm was used to test the criterion validity of the BDI by giving it to two groups
one not depressed
one diagnosed as depressed by 4 psychiatrists
depressed people scored higher (closer to 63), criterion validity established
- also used to calibrate low, medium, and high scores particularly
how did Diener’s SWB use known-groups paradigm?
in a review article, SWB scale averages from various studies
college students scored much higher than prisoners- such known-groups patterns provide strong evidence for criterion validity
convergent validity
whether a measure correlates strongly with other measures of the same construction
example of convergent validity
to see if BDI quantified depression- had adults complete BDI along w other self-report measures of depression (CES-D)
scores where strongly correlated ( r was .68)
Can you definitively establish validity?
no single definitive outcome will establish validity
validity of all parts/measures (BDI and CES-D for example) have to be established with evidence
eventually may be satisfied that a measure is valid after evaluating the WEIGHT and PATTERN of the evidence
which validity do many researchers think best predicts actual behaviors?
criterion validity
can similar (not same) constructs be used to establish convergent validity?
yes- for example SWB scores were used to establish convergent validity for the BDI- scores had negative correlation (r = -.65)
Discriminant validity
a measure should correlate less strongly with measures of different constructs
sometimes helpful in differentiating similar diagnoses
usually not relevant to establish between something completely unrelated- should be something similar but different
example of discriminant validity
- BDI and Physical health problems weakly correlated (r = .16), evidence for discriminant validity
- whether a child has autism or only a language delay
- scale to diagnose learning disabilities shouldn’t correlate with IQ
what type of measures are convergent and discriminant validity usually used for? how does it help?
convergent and discriminant validity are usually evaluated together as a pattern of correlations among self-report measures
no strict rule for what the correlation should be, just the overall pattern helps to see if the operationalization measures what its supposed to
can a measure be more reliable than valid? more valid than reliable?
a measure can be more reliable than valid, but not the other way around-
needs to be consistent with itself in order to be strongly associated with something else
reliability is a necessary condition for validity but it is not sufficient
when you read a research study, ask about the measures:
did the researchers collect evidence that their measures have construct validity?
if they didn’t do it themselves, did they review construct validity evidence of others?
where in journal articles will you find reliability and validity info?
methods section