Chapter 5- Identifying good measurement Flashcards
Conceptual definition
The researcher’s definition of the variable in question at a theoretical level.
Operational definition
A researcher’s decision about how to measure or manipulate the conceptual variable
How are conceptual variables operationalized?
Researchers start by stating a definition of their construct (the conceptual variable) and then create an operational definition. Ex- measuring gratitude toward a partner by asking people how often they thank their partner for something they did. Even a simple variable like gender needs to be operationalized.
3 common types of variables
- Self report measures
- Observational measures
- Physiological measures
Self report measures
Operationalize a variable by recording people’s answers to questions about themselves in a questionnaire or interview. Diener’s five-item scale is an example, as is asking someone to report their gender identity. For children, self reports can be replaced with parent or teacher reports
Observational measures
Operationalize a variable by recording observable behaviors or physical traces of behaviors. For example, operationalizing happiness by observing how many times a person smiles. Intelligence tests are also observational measures, since an individual’s intelligent behaviors are being observed
Physiological measures
Operationalizes a variable by recording biological data, such as brain activity, hormone levels, or heart rate. Usually requires equipment to amplify, record, or analyze. One way to operationalize stress could be to measure the amount of cortisol released in the saliva (stress hormone).
Which operationalization is best?
One construct can be operationalized many different ways. Physiological measures aren’t necessarily the most accurate, and they must also be validated by other measures. For example, fMRI tests can be used to learn that the brain works more efficiently relative to level of intelligence. However, in this case participant intelligence was determined prior to the scans using an IQ test- an observational measure.
How many levels must each variable have?
All variables must have at least 2 levels, but the levels of operational variables can be coded using different scales of measurement.
Categorical/nominal variables
Variables that are categories- sex, species, and others. The researcher might assign numbers to each category, but the numbers don’t have numerical meaning or quantify the difference between categories.
Quantitative/continuous variables
Variables that are coded with meaningful numbers, like height, weight, level of brain activity, or scales that produce quantitative scores (Diener’s scale of well being).
3 types of quantitative variables
- Ordinal scale
- Interval scale
- Ratio scale
Ordinal scale
Applies when the numbers of a quantitative variable represent a ranked order- these rankings could be unequal. For example, a bookstore might rank their top 10 best selling books, but we don’t know how many more copies of book 1 were sold than book 2.
Interval scale
Applies to the numerals of a quantitative variable that represent equal intervals (distances) between levels. Also, numerals must not have a “true zero”. For example, the distance between each degree on the celsius scale is equal. There is also no true zero, because 0 degrees celsius (freezing point) does not mean that something has “no temperature”.
Ratio scale
Applies when the numerals of a quantitative variable have equal intervals, and a value of zero actually means “none”. For example, a score of zero on a knowledge test when measuring how many questions people get right does actually mean zero- the individual got 0 questions correct.
Reliability
Refers to how consistent the results of a measure are
Validity
Refers to whether the operationalization is measuring what it’s supposed to measure
3 types of reliability
- Test-retest reliability
- Interrater reliability
- Internal reliability
Test-retest reliability
A study participant will get pretty much the same score each time they are measured with it. Applies whether the operationalization is self report, observational, or physiological, but it’s most relevant when researchers are measuring constructs. If participants take an IQ test one day and then take it again a month later, the pattern of scores should be consistent.
Interrater reliability
Consistent scores are obtained no matter who measures the variable.
Internal reliability
A study participant gives a consistent pattern of answers, no matter how the researchers phrase the question. People who agree with the first question on the well being scale should also agree with the next few questions
Statistical devices that researchers can use for data analysis (2)
Scatterplots and the correlation coefficient
How can scatterplots indicate interrater reliability?
Scatterplots can show interrater agreement or disagreement. Example- two observers rate how happy children seem while playing. With high interrater reliability, each observer would give each child a similar ranking, and the data points would be clustered close to the line of best fit on a scatter plot. With low interrater reliability, rankings will differ more significantly, and the points are not clustered close to the line.
What could cause low interrater reliability?
This could be due to the observers not having a clear enough operational definition of happiness. Also, the coders might not have been trained well enough yet.
Correlation coefficient (R)
A single number that indicates how close the dots are to the line on the scatterplot. The R value can be positive or negative, which indicates the slope direction. The R value is always between -1 and 1. A strong relationship means the R value is close to -1 or 1. If there is no relationship, r will be .00 or close to it
Slope direction
The direction of the slope of the line of best fit. It can be positive, negative, or zero
Strength of the relationship
The relationship between variables is considered to be strong when the dots in a scatterplot are close to the line
How is test-retest reliability assessed using r?
To assess this, we would assess the same set of participants on that measure at least twice. We would record each person’s score at time 1 and time 2 (around 2 months apart) and calculate R. If R is positive and strong (.5 or higher) the test-retest reliability is good. If positive but weak, we know that the scores have changed.
When would a low r value indicate poor test-retest reliability?
A low R is a sign of poor reliability if we are measuring something that should stay the same over time. If measuring IQ, it should stay the same over the span of two months. If measuring something like seasonal stress, R will be low because this is a construct that changes over time
How is interrater reliability assessed using r?
To test this, we would ask two observers to rate the same participants at the same time, and then compute R. If R is positive and strong (.70 or higher), then reliability is good. If positive and weak, reliability is low. A negative correlation is rare but would indicate a problem with the observers
R is best used to evaluate interrater reliability when observers are rating which type of variable?
R can be used to evaluate interrater reliability when the observers are rating a quantitative variable. A statistic called kappa is more appropriate when observers are rating a categorical variable. A kappa close to 1 means that the raters agree.
When is internal reliability relevant?
Internal reliability is relevant for measures that use multiple items or observations to get at the same construct. A scale with 5 items that say roughly the same things worded differently should mean that a participant should answer all items consistently
How are responses quantified to determine internal reliability?
Researchers ask the participants to answer all of the items. Then, they compute the correlations between every item and every other item. They compute the average inter-item correlation (AIC)- the average of all of these correlations. AIC from .15-.50 means that the items go well together. They compute Cronbach’s alpha- mathematically combines the AIC and the number of items in the scale. The closer it is to 1, the better the scale’s reliability
Construct validity
How well a measure measures the conceptual variables it was intended for
How are validity and reliability different?
Validity and reliability are separate concepts. For example, an adult’s scale might say they weigh 50 pounds every time they step on it. It’s reliable (consistent), but not valid (the measurement isn’t accurate). Reliability is necessary for validity- a measure can be less valid than it is reliable, but it can’t be more valid than it is reliable. If a measure doesn’t correlate with itself, then how can it be more strongly associated with some other variable?
2 subjective ways to assess validity
Face and content validity
3 empirical ways to assess validity
Criterion, convergent, and discriminant validity
How do we measure validity of abstract concepts?
Abstract concepts would include happiness, intelligence, stress, and self-esteem. There is no way of directly measuring how happy someone is, although we can estimate it in multiple ways. We can know if operationalizations are measuring our construct by collecting a variety of data and evaluating it in light of our theory about the construct
Face validity
A measure has face validity if it is subjectively considered to be a plausible operationalization of the conceptual variable in question. Measures with this validity align well with the conceptual definition.
Example- head circumference would have a high face validity for hat size but low face validity for intelligence
Content validity
A measure must capture all parts of a defined construct. Ex- a conceptual definition of intelligence could be the ability to “reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly, and learn from experience”. To have adequate content validity, an operationalization of intelligence should include questions or items to assess each of these 7 components.
Criterion validity
Evaluates whether the measure under consideration is associated with a concrete behavioral outcome that it should be associated with, according to the conceptual definition. Criterion validity is important for self report measures because the correlation can indicate how well people’s self reports predict their actual behavior.
Types of evidence for criterion validity (2)
- Correlational evidence for criterion validity
2. Known-groups evidence for criterion validity
Correlational evidence for criterion validity
For example, a sales company is choosing between aptitude test A and aptitude test B - they have face and content validity, but do they correlate with the key behavior- work success? The company can collect data to tell them how well aptitude tests are correlated with success with sales. Both sales tests are given to all current sales representatives and then their number of sales is determined- two scatter plots are made to determine the correlation between aptitude test A and sales and aptitude test B and sales. Aptitude test A has a stronger correlation- we can conclude that test A has better criterion ability as a measure of selling ability
Known groups paradigm
Another way to gather evidence for criterion validity in which researchers see whether scores on the measure can discriminate among two or more groups whose behavior is already confirmed. For example, to validate salivary cortisol as a measure of stress, a researcher could compare the salivary cortisol levels in two groups of people- those who are about to give a speech in front of a classroom and those who are in the audience. If salivary cortisol is a valid measure of stress, people in the stress group (public speaking) should have higher cortisol levels than those in the audience.
How is the known groups method used to validate self report measures?
An example is the Beck Depression Inventory. This is a 21 item self report scale where participants circle one of 4 choices. The scores are added to get a total from 0-63. Participants answered the inventory. Then, psychiatrists conducted clinical interviews to diagnose each person with depression (if they were depressed), as well as their level of depression. The average BDI score of the known group of depressed people was higher than the average score of the known people who were not depressed. The level of the BDI inventory also correlated with the level of depression
How was convergent validity determined for the BDI?
If the BDI really quantifies depression, it should be correlated with other self report measures of depression. A strong positive correlation between the 2 scores provides evidence for the convergent validity of the BDI. Convergent validity evidence also includes similar constructs, not just the same one. BDI scores were also strongly correlated with a score quantifying psychological well being. The strong negative correlation makes sense because people who are depressed are also expected to have lower levels of well being
How was discriminant validity determined for the BDI?
The BDI should not correlate strongly with measures of constructs that are very different from depression- it should show discriminant validity with them. We would not expect the BDI to be strongly correlated with a measure of perceived physical health problems, for example. We would expect the BDI to be much more strongly correlated with similar constructs than with constructs that aren’t similar. Example- many developmental disorders have similar symptoms. We wouldn’t want a screening instrument to diagnose a child with autism when they actually have a speech delay. It’s not necessary to establish discriminant validity with random other variables- we want to focus on other variables that are “near neighbors” of the one being evaluated.
Convergent and discriminant validity
Convergent validity and discriminant validity- the patterns of correlations with measures of theoretically similar and dissimilar constructs. Convergent and discriminant validity are usually evaluated together, as a pattern of correlations among self report measures. A measurement should have higher correlations (higher r values) with similar traits (convergent validity) than it does with dissimilar traits (discriminant validity).