ANOVA, Correlation, Regression, Multiple Regression, Hierarchical Models, Mediation Flashcards
Describe the basic ANOVA
predictor = categorical
outcome = continuous
P-value of the F statistic = significant difference
(somewhere in the means of our conditions)
Follow up with post-hoc tests of some sort
What is the F statistic
F = ratio of explained variance /unexplained variance
(in our model)
F = Goodness of model/Badness of model = Signal/Noise
The F-test can tell whether our multiple regression model is a good one, given the data
What is the model we’re testing in ANOVA?
there is a difference between the categories
comparing evidence in favour of a difference against the evidence that there’s nothing at all going on
How many conditions to do a post-hoc test
When we have at least 3 conditions we can do a variety of post-hoc tests for a basic ANOVA .
When we have only 2 conditions there’s no need for post-hocs (we’re already basically doing a t-test)
Name 4 post-hoc tests
Least Significant Difference (LSD) (boo!)
same as a t-test
does not adjust the type 1 error
Bonferroni (yikes!)
very strict adjustment
likely to make a type 2 error
Tukey’s Honest Significant Difference (HSD)
not as strict as Bonferroni, but still more restriction than the LSD test
Dunnett
If we’re only interested in the comparisons against a control:
need to tell Jamovi with one is the control condition so all comparisons are based on it
more power to find significant differences
Name 3 basic ANOVAs
One-way ANOVA: Tests a single predictor against any number of outcomes (all tested independently)
Univariate ANOVA: Tests one or more predictors against a single outcome (just called ANOVA in jamovi)
Repeated Measures ANOVA: Tests one or more within-groups predictors against a single outcome
Requires data to be laid out differently than the other two
How many means in t-test vs ANOVAs
t-test for 2 means exactly
anovas for 2 means and more
Name two types of averages
mean and median are both averages, two different ways to assess what is typical they are measures of central tendency
Name 3 types of t-tests
independent = each row is a different person
paired sample = one row provides one than one mean
one sample t-test = one mean comparing to zero
is the thing that I found different from zero
Should ANOVA and correlational analysis agree
The ANOVA did not find a significant difference because it was comparing each mean groups
however, the correlational analysis is not comparing means but instead finding the line of best fit : do not have to agree since they are based on different hypothesis
Even if there is not a major difference from one year to the next, the correlation can pick up on the change over time
if you have a correlation of 0.170, the r2 would be the portion that is explained by the variable
How many predictors in correlation versus linear regression
the amount of predictors is the only difference where a linear regression can have more than one predictor, whereas a correlation assumes is only one
What is the different between R from correlation and linear regression in Jamovi
linear regression option in Jamovi = calculates R2 first then ‘unsquares’ it.
important because the value will always be positive
Which betas are standardized vs unstandardized
the beta in the R is standardized using Z-scores
used to describe the strength of the relationship
the beta in the predictor is unstandardized
used to draw the line of best fit
Describe correlations
The correlation coefficient (r) is a statistic that represents the strength of the relationship among two continuous variables.
The correlation coefficient gets a p-value attached by working in assumptions about how likely it is to find a relationship as strong as the one we observed by chance alone, given the sample size
In this way, significance is determined just like it is for any of the other statistics we’ve discussed
The relationship being described is always a straight line – or the line of best fit
What is the basic regression model
The regression model is our attempt to use a straight line to represent the relationship between two variables.
The formula for a regression line of best fit can be written out as:
Y = bX + c
The strength of this relationship is the beta coefficient
It describes the slope of the straight line as an unstandardized measure
It’s equivalent to R as a standardized measure, as long as we have only two variables in the model
b = R (only with 2 variables)
c is the intercept, aka point to which they cross the vertical axis
Describe the two kinds of beta
Unstandardized
Retains the original units of measurement for the variables
Difficult (or, realistically, often impossible) to compare against other betas in multiple regression model
Standardized
Converts variables to Z scores before calculating correlation
Allows for easy comparison against other betas in multiple regression model
e.g. is this predictor stronger than this one
Describe the Multiple Regression Model
just a regression, where we add more lines to the model.
Y = b1X1 + b2X2 + c
Or, more completely, as the Field textbook likes to write it:
Y𝑖 = (𝑏0 + 𝑏1X1𝑖 + 𝑏2X2𝑖) + 𝑒𝑖
You can have as many (bnXn) combinations as you like, to suit your needs
Each one is a different IV within your model
When you add an IV, you change how the variability in the data are interpreted, and hopefully how much is explained
Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)
Describe the Three Kinds Of Variability
Total Variability
This represents the variability between the observed scores and the most basic statistical model
x - xbar
Residual Sum of Squares
This represents the variability between the observed scores and the line of best fit
This variability is indicated by the blue dashed lines
Now we’re comparing observations against the solid blue line instead of the orange line
x - line of best fit
Model Sum of Squares
This represents the variability between the overall mean model and the line of best fit
If this value is large, then the regression model is better than the mean model
line of best fit - xbar
How can you calculate the proportion of variance explained
overall proportion of variance explained by a multiple regression model.
R2 = SSm/SSt
R is just like r but it’s meant for more than two variables
SSt is the Total Variability
SSm is the Model Sum of Squares
we lose the direction of the relationship because we square the values and get rid of the negative values
How can you calculate the f statistic of model fit
Model Sum of Squares / Resudual sum of squares = F model fit.
two conversions are needed. SS –> MS
MSm = SSm/k
MSr = SSr/N-(k-1)
F = MSm/MSr
F statistics help us understand if the multiple regression model predicts the relationship between our predictor and the outcome better than the mean model.
What is the p-value of F statistic telling us
The p-value helps us decide if we are confident enough that the regression model is strong enough to reject the null hypothesis
What can you do from a non-significant R model
A non-significant R model can still point us to how we can simplify our model to regain power and diminish the lost degrees of freedom
Name the first 8 assumptions of multiple regression
- Linearity
- No Perfect Multicollinearity
- Independent Errors
- Homoscedasticity
- Normally Distributed Errors
- No Missing Predictors
- All Variables are Continuous
- All Variables Must Vary
Describe the assumption of linearity
We need to be able to deal with straight lines
In the simplest terms, the outcome variable (DV) should be linearly, not curvilinearly, related to each predictor (IV)
Ideally, the influences of the predictors are additive
That is, they each provide independent prediction of the DV, they each add a unique contribution to the equation and doesn’t overlap with the other predictors
Describe the assumption of multicollinearity
two or more predictors are highly correlated.
There are two statistics that can help us detect it: Tolerance and VIF
Tolerance: Ranges from 0 – 1, and a tolerance value below .10 is likely a problem
VIF: Starts at 1 and goes up, values above 10 are likely a problem
to fix it, we must remove one of the predictor to reduce the collinearity
Describe VIF
The Variance Inflation Factor
calculated = multiple regression using only the IVs (predictors).
The DV is not included so VIF is the same regardless of your DV
VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs.
The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2
In other words, how much is the proportion of the variance one IV can be explained by the other IV
Describe Tolerance
Tolerance is the bottom part of the VIF equation:
VIF = 1/1-R2
Tolerance = 1-R2
Will always agree with VIF
Describe the assumption of Independent Errors
For any two observations, the residuals should be uncorrelated. Residuals are the unexplained portion of variance (leftovers)
In other words, what is not explained (residuals) by the correlation should be random. If we violate this assumption, then the MR model p-value is invalid
In what scenario is the assumption of independent error rarely violated
Fortunately, independence is rarely an issue for researchers in psychology since we work with people and it is unlikely that their residual would correlate
When would you use the Dubin-Watson d test
To assess if the assumption of independent error is violated in a time-series data. You can use a statistical test called the Durbin-Watson d test to see if you have a big problem.
This test is called the Autocorrelation test in jamovi
Produces a value varying from 0 – 4, where optimal (no issue) is 2
You should treat values below 1 or above 3 as a cause for concern
Describe the assumption of homoscedasticity
At each level of prediction, the variance of the residuals should be constant.
I.e., the height of a residuals plot should be the same all the way from left to right
If we violate this assumption, the p-values of our betas are invalid. On the bright side, the beta values themselves are still accurate
If you see a funnel shape or bowtie, back away slowly
Our human brain tends to overestimate the spread of the cloud, so one point on the side can look like a funnel shape, but in reality the majority of the points are behaving randomly
Describe the assumption of normally distributed errors
We assume the residuals (unexplained variance) are random, normally distributed, and have an overall mean of 0.
If we violate this assumption, the p-values of our betas are again invalid. You can easily work around this problem by using large sample sizes.
Large sample sizes increase confidence, so it’s easy to trust our results are right regardless.
You can also use bootstrapping to overcome issues (in theory, because jamovi does not offer this option )
In jamovi: Under Assumption Checks, add the Q-Q plot of residuals. Look for big swings away from the 45-degree diagonal
Name two ways to have invalid p-values of betas
either violating homoscedasticity or violating normally distributed errors
Name two additional concerns to be aware of while running multiple regression
- You need a sufficient sample size
- You shouldn’t have outliers or, more accurately, influential cases. Because we want our line to represent the majority and not be strongly influenced by a few individual cases
Why is sample size important
For any correlational analysis it’s important to have a reasonably large sample size.
If you don’t, there’s a high probability of getting unstable correlation coefficients (i.e., big changes across samples)
This problem extends to MR, and is amplified because you now have multiple unstable coefficients
General rule of thumb:
Have a bare minimum of 40 participants per predictor, but get double or triple that if you can
Is it possible to have no missing predictors
We must accept that almost all of the time we will be missing at least one predictor that could impact the relationship observed
Remember that there are two components to the multiple regression model.
1. The overall model, including all variance shared with the DV (R)
2. The individual predictors, showing the unique shared variance with the DV (beta)
Every time you add or remove a predictor variable you change both components, fundamentally
The impact tends to be most evident on the betas
This is just the nature of partial correlations
What is Partial Correlation
Imagine variance is a circle
For two variables you have two circles, and the shared variance would be the amount of overlap among the circles
This overlap is also the correlation
Which means it’s the beta too
What is the relationship between partial correlations and dependence
In multiple regression you have at least three variance circles (e.g. cake icing example)
With three variance circles you also get three shared variances
Two of the shared variances are with the DV
One shared variance is among the IVs
Hiding inside the covariances with the DV is the shared variance among the IVs
How can multiple regression help us with partial correlations
Multiple regression attempts to assign importance to the different predictors, in the context of all the others by partialling the variance.
Beta: The overlap of the IV and DV is examined after discarding any shared variance
This is just the nature of partial correlations
Through Beta we uncover the unique variance that is shared between the IV and the DV – it is, of course, only part of the total effect the IV has on the DV
When one (standardized) Beta is bigger than another, it has a stronger influence. In this case, orange icing does seem to taste better
Why is it important that all variables must vary
If IV doesn’t change, then we cannot make any prediction of DV
If DV doesn’t change, then R2 cannot be calculated since there is no beta. y always equals the same regardless of the IV
When they both vary, we can make predictions by calculation R2, beta and drawing a line of best fit
Do all variables have to be continuous
Yes, and no.
More accurately, the difference between smaller and larger numbers must always be meaningful.
Ideally, variables are measured using an interval or ratio scale, but a good ordinal scale works well enough too.
A 2-option categorical variable (coded as 0/1 preferably) also satisfies this requirement
A value of 1 must indicate more of that category than 0
Can you use categorical variable in correlational analysis
when using two categorical variables, then the line of best fit simply connects both means and deciding if the slope is significant is the same as looking is the means are significantly different
therefore a significant slope = significant t-test
We cannot plot more than two categorical variables because the numbers are assigned arbitrarily, we can change the nature of the relationship by interchanging the order of the variables
How can you define outliers
An influential outlier is either a single observation (or perhaps a very small number of observations) that doesn’t match the pattern established by the rest of the sample.
Influential outliers are a problem because they increase the error in every conclusion we want to make
Influential outliers are a concern for all regressions, but in a multiple regression we’re especially worried about multivariate outliers.
These are unusual observations based on combinations of variables
“Unusual” has a fuzzy definition, but should certainly not be more than 10% of your sample
How can we calculate Multivariate Outliers
- Mahalanobis Distance: Looking for unusual combinations of predictors
- Cook’s Distance: Looking for unusual combinations of all variables
Unlike Tolerance and VIF, these two do not always agree.
Why are they both described as distances?
they are trying to identify the cluster of normality within that space of what the typical score looks like, then trying to figure out how far away is any individual observation from that cluster of normality
Describe Cook’s Distance
This distance tends to find fewer outliers, if you follow traditional conventions. It produces a minimum score of 0 and goes up “to infinity, and beyond!”
Values above 1 are strongly indicative of outliers
Values below 1 can still be considered outliers, if they are both relatively large and very unlike the other distance values
The calculation looks at all variables in your regression model, including the dependent variable
Describe Mahalanobis Distance
This distance is a multidimensional Z-score based only on the IVs in your model, that also begins with a minimum score of 0 and goes up.
Distance values above but near 1 are quite small here, and not indicative of outliers
The distance values are not directly interpretable; we need to calculate a p -value
The p-values for the distances are calculated by comparing against a χ2 distribution, taking into account the distance itself and the number of predictors in the model (degree of freedom)
We want to exclude p-values below .001 (not .05)
How can you identify influential outliers using Mahalanobis Distance
All you really need are the row numbers, if provided
If no outliers are found, you will just see a message saying no outliers were found.
If there are outliers, you will get a table giving the Row #, Distance, and p-value for each outlier
As p is always < .001, if you want to see the number you will need to change jamovi’s default p-value format
How can you identify influential outliers using Cook’s Distance
Look for values above 1 in the table or examine the plot
The numbers shown above the lines here are row numbers, the height of the line is the Cook’s Distance.
A visual inspection is the easiest way to spot irregularities, where scores are unusual despite not going above 1
When should you exclude influential outliers
You should look at both the Cook’s Distance and Mahalanobis Distance results and exclude rows of data that were deemed outliers by either method.
In theory, you repeat this process until no new outliers are found
What are hierarchical models
We don’t always have clear-cut models in mind for our multiple regressions. Other times we have very specific predictors in mind and want to test them.
Sometimes we want to go exploring, letting parsimony guide us
Sometimes we want to see how certain variables change the result
Name two ways you can build Hierarchical Models
There are two ways to make hierarchical models.
- Build a model up from nothing by adding predictor variables in stages
- Take a complex model and make it simpler by removing predictors
Remember: Every variable you include in a model contributes to the definition of shared variance for that model, affecting the independent contributions of the other variables
The blocks are evaluated separately from each other – only the R2 change is affected by the hierarchy you create
What is the logic of Hierarchical Models
Whether you are adding or removing predictor variables, you want to see what effect that change had.
As an overall effect, you want to see how R2 changed
You also want to know whether that amount of change was significant
If your model produced a significant change, you would then want to look at how the predictor betas were affected by the change in variables
If the change is not significant and we keep the predictor, we are going against parsimony which says that simpler assumptions make better models
What does the R2 from the VIF and tolerance calculation refer to?
VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2
Not using the R2 of the model, but from the new calculation between the relationship of all predictors
What important assumption does the multiple regression model make?
Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)
What happens to your VIF if you add a DV
nothing since it only uses IVs
If you add an IV to your multiple regression analysis, what happens to your variability
you change how the variability in the data are interpreted, and hopefully how much is explained
If our assumption of independent error is violated, what is compromised
our MR model p-value is invalid
If our assumption of homoscedasticity is violated, what is compromised
the p-values of our beta, but not betas themselves
If our assumption of normally distributed error is violated, what is compromised
he p-values of our beta again, but not betas themselves
How can you fix the issues around the normally distributed error
large sample sizes increase confidence in results and use bootstrapping
Why is sample size especially important in MR
high probability of getting unstable correlation coefficients, big changes across samples. This problem extends to MR, and is amplified because you now have multiple unstable coefficients
What is the difference between R and beta
overall model prediction vs individual contribution to the predictor
What should be the mean of residuals
0
What is the difference between r and R
R is just like r but it’s meant for more than two variables
Which statistics is used to describe the proportion of variance explained
R2
Which statistics is used to describe the strength of association between predictors and outcomes
R
Which statistics is used to decide whether our model is a good fit
F
How should you look at Mahalanobis Distance
exclude p-values below 0.001, distance themselves are not directly interpretable
Which unit is used to calculate Mahalanobis Distance
multidimensional Z-scores based only on IV of the model
SART if you just made a mistake your odds of making another one is
Higher