Path Analysis Flashcards
Diagram conventions
Square
□
= observed / measured variables
Diagram conventions
Circle
◯
= latent / unobserved variables
Diagram conventions
Double-headed arrow
↔
= covariance
Diagram conventions
Single-headed arrow
→
= regression path
What issues do path models solve?
A path model allows us to test several linear models together as a set ( = multiple non-nested equations)
They are based on the correlation matrix of the measured variables in your study
Exogenous variables
have direct arrows going out → but none going in
they are essentially independent variables
- nothing effects these variables
- they effect outcomes
I like to remember them as they’re an ex and no one likes them so no one goes in but they’re chasing people
Endogenous variables
have direct arrows going in ← (can also have them going out)
they are dependent variables in at least one part of the model (hence the arrow going in)
- they predict something but can also be predicted by something else
in linear models there is only one endogenous variable but in path models we can have multiple
Endogeneity bias
= a hidden variables we haven’t accounted for that still effects our study
e.g. leaving out a measure of intelligence on a study of school test scores
Basic structure of a path modelling
Input of path models (study results)
↓
Correlation matrix
↓
Define a model that explains the relationship
↓
How well can our model reproduce the observed correlation matrix?
Lavaan
Latent variable analysis
this is the package in R which we use to fit path models (it has sensible defaults so most of the time we just give it our specified model and our dataset)
it requires 3 steps:
1) specify the model and create a model object
2) run the model using sem() function
3) evaluate the model
Lavaan
Model statements
observed variable = use the name given in the dataset
latent variable = give a new name
covariance = use ~~
regression path = use ~
Model specification
What is specification?
Specification concerns which variables relate to which others and in what ways
it is also where we formally set out our theory and hypothesis
for path analysis this is where we outline our model and then use the sem() function
This means basically writing down the paths that are included in your theoretical model.
Model specification
Path model standard rules
1) all exogenous variables correlate
2) for endogenous variables, we correlate the residuals, not the variables
3) endogenous variable residuals do NOT correlate with exogenous variables (we hope)
4) all paths are recursive (i.e. we can’t have loops)
Model identification
what is identification?
Identification concerns the number of knowns vs the number of unknowns ( = degrees of freedom)
Model identification
The Knowns
- variances of measured variables
- covariances between the variables
- the unique values in a correlation matrix
- in the correlation matrix this is the values on the diagonal and below
Model identification
The Unknowns
- the parameters we want to estimate
= all the lines we include in our diagram
= the variances of all variables (estimated), covariances and regression paths
Model identification
Degrees of Freedom (of path models)
= difference between the knowns and unknowns
df must be positive = we must have more knowns than unknowns = meaning our model simplifies our data
Model identification
t-rule
Used to calculate the knowns :
[ k * (k+1) ] / 2
Where:
k = number of observed variables
e.g. k = 5
[ 5 * (5+1)] / 2 = (5*6)/2 = 30/2 = 15 knowns
Model identification
levels of identification
Under identified models
Have <0 df
Model identification
levels of identification
Just identified models
Have 0 df
- all standard lms are just identified
Model identification
levels of identification
Over identified models
Have >0 df
= some flexibility to estimate parameters
Model Estimation
estimating path models
Model estimation = ‘best’ values for unknown parameters
path model estimation = finds values for parameters that minimise the difference between the observed correlation matrix and the model correlation matrix
maximum likelihood estimation is the most common method used
- it is an iterative process that terminates when altering model values no longer improves the model = convergence has been reached
- if the model fails to converge follow the same steps as MLM
Model Evaluation
If a simplified model can reproduce the relationships in the data, it is a good model
comparing the observed correlation matrix with the model implied correlation matrix is key to evaluating how good our model is
Model evaluation
path tracing
path tracing = when we specify a model, we use the parameter estimates to recalculate the correlations/covariances
Model fit
in path models
in path models we tend not to focus on variance explained in the outcome (as we would for MLM)
instead we ask does our model fit the data? if so, what are the parameter estimates?
‘fitting the data’ refers to how well our model implied correlation matrix reproduces the observed correlation
- if it does this well = it fits (but this is a continuum so some fit better than others)
just-identified models will always fit perfectly
If we have positive df we can calculate model fit indices
Model fit
model fit indices
Global Fit (chi squared)
Statistically significant chi squared = POOR FIT
when we use MLE we obtain a chi squared value for the model which can be compared to a chi squared distribution with the same dfs as our model to determine significance
BUT this does not work well in practice as it leads to the rejection of models that are only trivially mis-specified
Model fit
model fit indices
Absolute fit (SRMR)
values <0.5 = GOOD FIT
SRMR = standardised root mean-squared residual
measures the discrepancy between observed correlation matrix and model implied
ranges from 1 (terrible fit) to 0 (perfect fit) which is stupid and confusing
Model fit
model fit indices
Parsimony Corrected (RMSEA)
values <0.5 = good fit
RMSEA = root mean-squared error of approximation
this corrects for the complexity of the model and rewards simpler models by adding a penalty for more dfs
ranges from 1 (terrible fit) to 0 (perfect fit) which is stupid and confusing
Model fit
model fit indices
Incremental fit indices
Comparative fit index = >0.95 = good fit
- ranges from 0 to 1 where 1 = perfect fit
Tucker-Lewis index (TLI) = >0.95 = good fit
- includes a parsimony correction
Compares the model to a more restricted baseline model - usually an ‘independence’ model where all observed variable covariances are fixed to 0
Model fit
model fit indices
Local Fit
it is possible to examine local areas of mis-fit
Modification indices = estimate the improvement in chi squared that could be expected from including an additional parameter
Expected parameter changes = estimates the value of the parameter, were it to be included
Model modifications
= they indicate how much your model would improve if you added a path to your model
modification indices and expected parameter changes can be helpful for identifying how to improve a model but this is purely EXPLORATORY
they can be extracted in R using:
modindicies(model)
HOWEVER:
- modifications should be done iteratively
- they might just be capitalising on chance
- must ensure modifications can be justified
- ideally, we would need to replicate the new model in an independent sample
Interpreting path models
If our specified model fits the data, we can interpret the parameter estimates
Recall these are just correlation and regression paths so we interpret them the same way we would r and β coefficients
What is mediation?
Mediation is when a predictor X has an effect on outcome Y via the mediating variable M
The mediator transmits the effect of X to Y
In reality there is no such thing as direct effects - everything occurs via mediation
e.g.
- anxiety (X) decreases physical health (Y) due to lack of sleep (M)
Path model mediation
traditional roles of mediation were based on comparing across linear models but these suffer from low power and are very cumbersome
path model mediation is better than traditional methods but should only really be used with longitudinal data as mediation occurs over time
Path model mediation (on cross-sectional data)
Indistinguishable models
mediation is possible to do on cross-sectional data but there is a big conceptual problem:
we are modelling correlations → cross-sectional data means we have multiple indistinguishable models → so there is nothing to demonstrate whether one model is better than another
What is moderation?
moderation is when a moderator z modifies the effect of x on y
- e.g. the effect of x on y is higher at stronger levels of z
- also known as an interaction between x and z
Path Mediation
what are total effects?
= the overall effect of a predictor on the outcome is known as the total effect
total effect = indirect + direct effect
They can be interpreted as:
the unit increase in Y expected to occur when X increases by one unit
Path Mediation
what are direct effects?
= The effect of x on y (NOT via the mediator)
In a path model it would look like this:
X → Y
They can be interpreted as:
the unit increase in Y expected to occur with a unit increase in X over and above the increase transmitted by M
NOTE: the direct effect may not be direct in real life - they could be effected by other mediators we haven’t included in our model
Path Mediation
what are indirect effects?
= the effects of X on Y transmitted VIA the mediator
To estimate indirect effects we multiply the paths
( X → M) by ( M → Y)
They can be interpreted as:
the unit increase in Y expected to occur via M when X increases by one unit
Path mediation
Testing Mediation
Demonstrating mediation will usually rely on:
- evaluating the significance of direct, total and indirect effects
- considering the proportion of the total effects which is due to the mediated path
Proportion mediated = indirect / total
Path Mediation
Testing a path mediation model in lavaan
1) Specification
= create a lavaan syntax object
2) Estimation
= e.g. using maximum likelihood
3) Evaluation / interpretation
= inspect the model to judge how good it is
= interpret the parameter estimates
We constrain some of the paths in our model to 0 ( saying there’s no variance) so we can test how well our model predicts our observed correlation matrix given restricted paths
e.g basically, pick 2 arrows on the diagram see how well they predict, pick different arrows see if they predict better etc.
- we can choose specific paths to answer specific RSQs
Path Mediation
coding effects
to calculate the indirect effects of X on Y in path mediation, we first need to create some new parameters
We label these from our path model:
a = regression coefficient for M ~ X
b = regression coefficient for Y ~ M
c = regression coefficient for Y ~ X
In r we then use := to create a new parameter e.g.
indirect := ab
total := (ab) + c
Path Mediation
Model evaluation
We want to see:
- model estimates
- model fit
- standardised solutions
- (possibly modification indicies)
Path Mediation
Model Output
Things to note:
1) significant effects = look at p-values
2) degrees of freedom = if they are positive we can assess model fit
Path Mediation
Significance of Indirect effects
As indirect effects are estimated from parameters instead of the data, we can not calculate the standard errors
Default method of assessing statistical significance of indirect effects is we assume a normal sampling distribution
BUT this may not hold up for indirect effects that are the product of regression coefficients
instead we use bootstrapping (if 95% CI includes 0, indirect effect is not significant at 0.05 sig level)
Path Mediation
Significance of Indirect effects
Bootstrapping CIs in lavaan
1) run the model
- using “ se = ‘bootstrap’ “
2) view the output with CIs
3) (if needed) standardise parameters (e.g. if measurements don’t have easy interpretations)
- using “std = T”
What if the model doesn’t fit?
REMEMBER the goal is not to achieve model fit
if model fit is poor we should not draw substantive conclusions from it but we can assess why fit is poor.
Path mediation
Model modification
you may want to modify your initially hypothesised model e.g. non-significant paths to remove, include some other paths etc.
BUT as soon as we make a modification we are no longer testing the model in a confirmatory way
Our analysis shifts to being led by the data rather than theory and this is not preferred
Similar to MLM modification
Reporting path mediation models
Method/analysis strategy
mention:
- the model being tested
e.g. Y was regressed on both X and M, and M was regressed on X - the estimator used
e.g. maximum likelihood - the method used to test significance of indirect effects
e.g bootstrapped 95% CIs
Reporting path mediation models
Results
*model fit (for over identified models)
*parameter estimates for path mediation and their statistical significance
- can be useful to present in a SEM diagram BUT the diagrams in R are not considered publication quality
Reporting path mediation models
SEM diagram
- include key parameter estimates
- include statistically significant paths (indicated with *)
- basically just add numbers to our path diagram
*include figure note that explains how statistically significant paths are identified and at what level
Reporting path mediation models
Visualising the model
There are a number of R packages that will produce path diagrams
BUT the presentation of these is not always clear and it can be difficult to refine them
John used powerpoint to make the diagrams
Reporting path mediation models
The indirect effects
- results = the coefficients for the indirect effect (significate, direction of effect +/- etc.) and the bootstrapped 95% CIs
- common to report proportion mediated - but this should always be interpreted within the context of the study
- interpretation can be tricky if there’s a mix of + and - effects involved
Other path analysis models
Anything that can be expressed in terms of regressions between observed variables can be tested as a path model
- can include ordinal and binary data
- can include moderation