final Flashcards
Psychometrics
How can we measure constructs like depression, anxiety, loneliness, etc
Psychology is the science of thoughts, feelings, and behaviors
Problem– thoughts and feelings are not directly observable
Solution to this is psychometrics
Test serves as a proxy (way to sort of determine) for what we can not see
psychometrics measurement
scoring individuals on characteristics that can’t be easily observes
Measure development– Writing items, scoring procedures
Measure evaluation– Determining whether measure is reliable and valid
Measurement is not a linear process, its is typically ongoing
Items and scales will always be revised over and over again to get the most accurate results possible
psychometrics examples
cognitive ability via tasks/tests requiring cognition
Knowledge on t-ests and ANOVA bias exam 2 scores
Conscientiousness via your answer to 10 questions
Stress via your salivary cortisol levels
Looking a biological measures
Goals of psychometrics–
classify/group people into categories
nominal/ordinal variables
Ex– questions about educational attainment, attachment style
Essentially just grouping people
Goals of psychometrics–
interval/ratio variables
Ex– questions about extent of consciousness, severity of depression symptoms
Measure reliability
consistency/precision of scores across
Time– want to see similarities between time one and time two
Items– responding to items in a consistent manner
Raters– don’t want two raters to observe two things and come to two completely different conclusions
Measure validity
accuracy scores
Are the scores measuring what they are supposed to measure?
“Construct validity”– are you looking at what your are intending to
Reliability versus validity
they do relate but its imperfect
Dots on target examples
Dots close to target– accurate
Dots close together– consistent/precise
Can an unreliable measure be valid– no, has to have consistency for it to be valid
Can an invalid measure be reliable– yes, something can be reliable but consistently wrong
Test-retest reliability
are scores similar when measured at different time points
Official name for asking about time
Always relevant for trait-like constructs
Personality
Intelligence
Test-retest reliability– Less relevant for state-like constructs– things that vary from day to day
Stress
Positive affect (emotion)
Negative affect
test-retest reliability– method
Relate time one scores to time two scores
Want the scores to be highly consistent with one another
Usually use correlation
Looking for an effect size of .70 or higher, generally
retest w/ paired samples t test
a bit controversial
Want to see a non-significant result (no change)
Controversial to look for a null result because runs risk of type one or type two error
Consistency across items– internal consistency reliability
Most psychological scales contain multiple items that, together, create a score
Internal consistency reliability– do items in a scale positively relate to one another
Not that they should be answering exactly the same way, but they should be close and there should be a pattern
Measurement error– differences in responses across the items
Attempt to correct this is aggregating the scores (adding them up) to cancel the small amounts of error
internal consistency reliability– method one (split half the correlation
Split half correlation– 10 diff questions people answer, create two random halves and create a sum score of 5 items in each set
Then look at the relationship between set 1 and set 2 with the expectation they will be related
If you don’t have .7, may have to start over
Problem with splitting up correlation– random half you pick might happen to just be different
internal consistency reliability– method two (Cronbach’s alpha– variance framework)
Dividing covariance (relationships) among all possible pairs of item over total variance across all items
More reliable than split half correlation
Essentially taking the average
Increase covariation across items = higher alpha
Increase item variance = lower alpha
Will always range between 0-1
Want alpha of .7 or higher
Interrator reliability
how consistent are two separate investigators scores for the same group of participants
Two investigators observe the same behavior in the same person at the same time point and score it
Very important– not about two separate experiments
Simple calculation– percentage of agreement between scores
Most relevant for behavioral measures, but sometimes can be relevant for surveys too
Face validity
does this look like its measuring what it is supposed to measure
Arguably weakest type
Ex– strong face validity to measure pandemic worry
Rate agreement from 1-5– “i am worried about the void-19 pandemic”
Use low face validity to measure social desirability
Ex– “i like to gossip at times”
Gives you an idea of how much participant is willing to lie
face validity problems
Completely subjective
Can have good measures, but will still have low face validity
Make decisions based on the questions you ask
Content validity
how does the operational definition match the conceptual definition of the construct
making sure each part of the construct you want to study is measured in some way
Looking at the match between your measure and the actual content itself
Still somewhat subjective
Not a formal test
Sometimes good measure will have low content validity
content validity example
stress
Conceptually, includes both physiological and physiological responses
In theory, good measures should include both
Criterion validity
are scores on the measure related to measures of other constructs (criterial) that they theoretically should be related to
Arguably the strongest
Looking for relationships through hypothesis tests
Concurrent criterion validity– criteria measured at the same time
Predictive criterion validity– criteria measured in the future
Ex– stress scale
Todays negative affect (concurrent)
Tomorrows negative affect (predictive)
Convergent validity
new measure is correlated with other established measures of the same construct
When developing a new stress measure
Have to find its relationship with
The perceived stress scale– an established self-report stress scale
Different from criterion because convergent is looking at measures of the same construct
However, with both you are trying to prove that they relate in some way
Discriminant validity
are scores on the measure not related to the measures of distinct constructs
Sort of controversial form of validity
Hard to find constructs that don’t relate to things like stress/depression, etc
Don’t want to find a relationship
Failing to reject the null
Don’t want a negative relationship either because its still a relationship
discriminant validity example
developing a new measure of stress
Find its relationship with
Social desirability scores
Demographic characteristics
Testing validity
Face– not test, visual inspection
Content– no test, visual inspection/theoretical deep dive
Criterion, convergent, discriminant– hypothesis tests
testing validity– t test and anova
wanting to establish relationships with grouping variables
testing validity– correlation or regression
wanting to establish relationships with other numeric variables
Sources of error in research studies
All research studies have error
Research design flaws– confounds, equipment failure, poor measurement/manipulation
Participants– lack of motivation/attention/understanding/human error
Data coding and entry– coding/entry errors
Outliers
extreme values, usually impassibly extreme
Two types
Error outliers– values that look extreme because of a mistake
Not real (ex– coding mistake, entering wrong)
Interesting outliers– values that are extreme but are not mistakes
Exceptions to general trends, usually worthy of follow-up
Quantitative tools
Our our variables are normally distributed, we have a sense of how unlikely it is to see extreme values
Can use z scores
Want to find scores higher than 2.24 in either positive or negative direction
If they do, calls for further investigation
How to handle outliers
Verify weather the outliers are meaningful or just errors
Determine if they are impossible
Check study logs and raw data
Impossible if it’s outside the scales range
Correct errors when you find then
If you can’t find an error or are unsure, treat it as interesting
Influential outlier– an outlier than changes results based on if its present or not
If there’s an outlier, run it with and without the outlier and report that that’s what you are doing
Step one is still finding out if its an error or interesting
Inattentive responding
People who misunderstand or don’t carefully respond to what they are being asked
Not fully engaged
Ex– answering strongly agree for all the answers
Typically seen at research pools at universities or surveys, but can happen in lab tasks too
Their data is essentially noise
“Infrequency” items
if people are paying attentions, they should all give the response
Subtle– I was born on february 30th
Does not exist, so should always be false
Usually better to go with something more subtle
Over– please answer 2 for this question
End of survey items
Asking if participant answered all of the items thoroughly/ what strategy did you use to answer the items
Problem is people usually lie
Better– asking open ended question asking what their approach was to answer questions
Free response questions can be very telling
Bots will usually report an answer that doesn’t make sense
Logic
looking at how fast participant took survey
Need to pilot the study to figure out a reasonable response time
Then subtract a small value to account for someone exceptionally fast
Online survey programs will track time per page
In the lab, administers can track time themselves
Low variability
someone who is always answering the questions in the exact same way
Long strings of the same answer
Approach– calculate individuals Sds across items to assess variability
Choose a minimum SD cutoff in advance
Controversial because it only works if you have positively and negatively worded items
Poor example– life satisfaction scale
Good example
People would decibel me as someone willing to share my time with others
Maintaining close relationships is difficult to me
How to handle inattentive responding
Always want to be on the side of inclusion
Plan your sample size to allow for someone inattentive respondents
can’t ethically require people to cooperate or punish them for low effort (withholding compensation)
Conduct once with inattentive responses and once without to see if anything changes
Choose cutoff in advance and report how many people you dropped
Missing data
Have to honor when people just don’t answer the question– can’t ask people to go back and finish
Some R functions won’t work
Calculation that assume you have complete data may be incorrect
Biggest issue– reduces sample size/ generalizability
Need to understand how much missing data we have and consider that in our analyses and conclusions
Missing completely at random
missingness that is unrelated to any study variable
Won’t impact conclusion
No correlation to missing data and other variables in data set
Missing at random
missingness that can be fully accounted for by other variables
Won’t impact conclusion
missingness can be fully accounted for by other variables in data set
Reason exists why they dont answer the question
Planned missing data
data you chose not to collect
Best missing data, not a problem
Choosing to give some people some questions and other people other questions
Purpose is to shorten survey
Systematically missing can impact your conclusions
Questions that are unclear, too sensitive, or inappropriate
Ex– asking gender identity and only having male/female option
Questions or measure at the end of a long study
Attrition– survey is too long, people stop answering
Participants with certain characteristics skip items about those characteristics
Ex– people with anxiety won’t respond to items about anxiety because of their anxiety
Listwise deletion
completely delete or ignore any participant that is missing data on any variable in your analysis
pros– all analyses now have the same sample size
Cons– you’re losing data
Pairwise deletion
use all of the data you can, exclude participants only when you don’t have enough information to complete an analysis
Pros– you can use all the data you have
Cons– diff sample sizes for diff analysis mean diff levels of power and precision
missing data– solution 2
maximum likelihood estimation– imputations that occurs during the estimation process of complex analysis
Uses all available data for each person
Determine their most likely value would be based on available data
Estimates model parameters based on these values
Works pretty well as long as
Proportion of missing data is not too large
You are confident that data are missing at random/planned
Model will estimate what the person will look like and include them in estimation if you call for it
missing data– solution 3
imputation
Mean imputation– the mean is our best guess at any single value
Problem– substituting the mean for missing values can distort the variances
because were usually trying to explain variance, that’s a big problem
Replace missing data with avg value
However, will mess up variance and may not be reflective of what they actually look like
Multiple imputation solves this variance problem
Impute several plausible values
Run analysis with all plausible values
Pool the results to obtain a stable estimate
non response bias
Response rate– the percentage of people who actually participated out of the total number invited
In research, we invite a lot of people who might not actually participate
We can’t make conclusion about people who didn’t participate
At minimum, we need to report our response rate when we can
Grouping variables normally have a manageable number of levels
Usually nominal or ordinal variables
Typically we don’t want to group/classify people
Instead, grouping things on a more dimensional level
Ex– to what extent are you depressed
Want to see if the rank on one scale relates to the rank on another scale
Why we use correlation
Correlation tests
Statistical methods to measure and describe the linear relationship between two continuous, numeric variables
Linear relationship– changes in one variable tend to be accompanied by consistent changes in the other variable
Predictable relationship
Have a rating for everybody in your data set
Almost never talking about an experimental design, instead its an observational study
Survey correlating two variables/ peoples scores together
Naturally occurring, no manipulation
Examples of observational, continuous variables
Individual difference measures– personality, intelligence
Key– how high is your intelligence level, not are you intelligent or not
Typical use for correlation
Prediction testing
Ex– does SES predict health
Can’t randomly assign people to have low SES
When ethically constrained, correlation testing is the next best option
Validity of a questionnaire
Validity– measures accuracy
Not talking about cause, just want to know if it relates
Reliability of a questionnaire
Reliability– measures consistency
Use for test retest
Depicting correlations
scatterplots
Each person has to have data on two continuous variables
Meet at the point where x and y axis score line up
Point on scatterplot is defined by score on both variables
Interpreting scatterplots
Form– linear, curved, clusters, no pattern
Direction– positive, negative, no direction
Positive– straight line, going from left to right
Negative– straight line, going from right to left
Strength– how closely the points for the main form
If its linear, how close to the line are they
If its close, indicates a strong relationship
Perfectly horizontal line– no relationship
Effects of outliers, restriction of range and rescaling
Correlations are not robust against restriction of range
Ex– variables of age is from 40-68
First 40 years are not accounted for, meaning you have a restriction of range
Rescaling of variables does not change correlations
Outliers can make your correlation look more stronger or weaker than it actually is
Could be error or interesting
Test statistic for correlation
Absolute value– tells us the strength of the relationship
Sign– tells us the direction of the relationship
Looking at the extent to which the variables covary with one another
More shared variability– stronger relationship, higher correlation
Vice versa for lower shared variability
simple regression
One predictor and one outcome
Simple linear regression– assessing the relationship between one predictor and one outcome
Matters which variable is predictor and which is outcome
In write up can’t say x causes y, have to say x predicts y
Results are scaled based on the outcome (y)
Enables us to predict y based on x with a linear model
Linear modeling
We represent the relationship between x and y with a straight line
A lot of times data is not linear or necessarily correct, but they can still be useful
Using a straight line
Keeps things simple and easy to see
Identifies the midpoint of the relationship between x and y
Takes the actual mean
Allows us to make predictions
How to draw this line
Use formula– y=bx + a
Same thing as y=mx + b
A– intercept
Point to expect line to intercept with x axis
When x = 0
B– the slope of the line
How steep it is
Larger b values = steeper slope
Direction
Positive slope– sloped up from left to right
Negative slope– slopes down from left to right
Regression analysis goal one– least sqaures solution
Least squares solution
Want to find the line, on average, that is best representing the data set
Minimizes error
Want to take the line with the least amount of vertical distance between predicted data point and actual data point
Problem– just because it has the lowest error of all possible options, does not mean there is not a error
Just because it’s the best does not mean its automatically good
Regression analysis goal 2– standard error of the estimate
Standard error of the estimate– standard distance between the actual and predicted values of y
Taking vertical distance and finding average/squaring
Want a small standard error
Smaller the standard error/closer to 0, better model will perform
Problems with just interpreting standard error
Depends on scale of measure
How to get everyone to agree
Effect size/goodness of first– we can find r^2 for our regression model
Just like n^2, interpret as the proportion of variance in the outcome that is explained by the predictor
How much variance are we explaining in the model
r^2 of .34– 34% of the variance in y is accounted for by the model
Significance testing
Regression and ANOVA– same analysis with the same process
Both break down variance
Borth use omnibus f-ratios to test the overall model before assessing the various components of the model
Only diff
Use regression when you have continuous predictors
Use ANOVA when you have discrete, grouping predictors
Analysis of regression
Null– slope of regression is zero
Alternative– slope of regression is not zero
Overall significance of the regression equation can be evaluated by computing a f ratio
To compute the f ratio, you first calculate a variance of MS for the predicted variability and for the unpredicted variability
If there is a positive correlation between x and y then the regression equation, y = bx +a will have
b>0
Positive correlation means slope will be positive as well
Multiple regression
2 predictions, 1 outcome
Def– regression analysis involving more than one predictor
Why psych needs them
Things are complex– any one predictor can only explain so much
Because things are complex, some people might run several linear regression models
Pointless because predictors are all related so we need to do it in one test
Predictor overlap
many variables are correlated, at least to a small extent
Just adding variables to the model does not mean better predictor accuracy
predictors with the least overlap possible is the most valuable
How much unique variability are you adding?
Too related means adding virtually nothing
Multiple regression equations
Determined by a least squared error solution
Minimize squared distance between the actual y value and the predicted y value
Same as simple linear regression, but now two or more predictors
Y = b1x1 + b2x2 + a
Adding second slope and second variable
X1 and x2– two diff predictor variables
B1 and b2– regression coefficients (slopes) for those variables
Intercept a is the predicted value when both x1 and x2 are 0
Comparing slopes of single predictors
We can calculate standardized regression coefficient by transforming all the raw scores to z scores before we gein the analysis
Extremely important to do in multiple regression
Looks like italicized b
Unstandardized b coefficients don’t have a natural scale so they are not directly comparable
Interpreted in terms of standard deviation
Slopes mean nothing and can’t compare the size of slopes if you don’t standardize them
ANCOVA
Mix between anova and regression
Can have continuous and grouping predictors at the same time
Usually, the grouping variables are manipulated– independent
The continuous variables are measured
Start with a simple regression model predicting your dvr from your covariates
R
After fitting the regression model, use anoa to understand the residual variance
residual variance
whatever variance wasn’t explained by the initial regression
Error or noise
Achieving constancy
Achieving equal impacts of confounds across levels/conditions of an independent variable
Extraneous– anything that differs across people
Confounding– when an extraneous variable systematically differs across experimental groups
When we anticipate a confound variable, we should measure is and statically control for it in our models
Statistical adjustments
Statistically removes vibrant form extraneous/confounding variables by holding their effects constant across groups
Tells us what the effect of the iv is above and beyond the effects of the extraneous or confounding variables
Problems with statistical controls (controlling for extraneous variables)
Need to know what the extraneous/confounding variable is beforehand
Fixing the problem after the fact
Systematic differences already exist– essentially just putting a band aid over it
Better to eliminate from the outset
More of a hail mary to fix study
Suggests that there may be problems with the study
what does Internal consistency
measure
Measures consistency across items
what does test-retest measure
Measures consistency across time
Testing people once and then retesting at a later date to see if changes occured
Want to do it on constructs you don’t expect to change depending on context
what does integrator consistency measure
Measures consistency across raters
Convergent validity
Is the new measure related to other measures of the same construct
Want to make sure it correlates to other established measures
Criterion- predictive/ concurrent
validity
Correlating it with test of a different construct that it theoretically should relate to
Ex– stress and negative emotion are two diff constructs, but should be relates
Why is multiple imputation usually the gold standard
Pick an algorithm and then pick the amount of imputations you want to do
Algorithm comes out with multiple possible values based on data
Allows you do get multiple estimates
Taking pooled estimate across all of them instead of just picking one
Safer and better than just taking the mean
Why do some researchers hate big data sets
Too much statistical power
Means everything will be statistically significant
Larger samples need smaller test statistics to be significant
Diff between simple linear regression and multiples regression coefficients
“While holding x constant”– multiple regression
What two pieces of info can we get from r^2
Tells us what % of variance is accounted for
Ex– 36– 36% of variance is accounted for by predictor variable
Also tells us what is not accounted for
64% is not accounted for