ANOVA, Correlation, Regression, Multiple Regression, Hierarchical Models, Mediation Flashcards
Describe the basic ANOVA
The basic ANOVA is used when we have a predictor that is categorical, and an outcome that is continuous.
We look at the p-value of the F statistic to determine whether there is a significant difference
If significant, it tells us there is some difference, somewhere in the means of our conditions
We usually follow up with post-hoc tests of some sort
What is the F statistic
The F statistic represents the ratio of explained variance to unexplained variance in our model.
F = Goodness of model/Badness of model = Signal/Noise
The F-test can tell whether our multiple regression model is a good one, given the data
What is the model we’re testing in ANOVA?
It’s the suggestion that there is a difference between the categories
So, we’re essentially comparing the evidence in favour of a difference against the evidence that there’s nothing at all going on
How many conditions to do a post-hoc test
When we have at least 3 conditions we can do a variety of post-hoc tests for a basic ANOVA .
When we have only 2 conditions there’s no need for post-hocs (we’re already basically doing a t-test)
Name 4 post-hoc tests
Least Significant Difference (LSD) (boo!)
same as a t-test
does not adjust the type 1 error
Bonferroni (yikes!)
very strict adjustment
likely to make a type 2 error
Tukey’s Honest Significant Difference (HSD)
not as strict as Bonferroni, but still more restriction than the LSD test
Dunnett
If we’re only interested in the comparisons against a control:
need to tell Jamovi with one is the control condition so all comparisons are based on it
more power to find significant differences
Name 3 basic ANOVAs
One-way ANOVA: Tests a single predictor against any number of outcomes (all tested independently)
Univariate ANOVA: Tests one or more predictors against a single outcome (just called ANOVA in jamovi)
Repeated Measures ANOVA: Tests one or more within-groups predictors against a single outcome
Requires data to be laid out differently than the other two
How many means in t-test vs ANOVAs
t-test for 2 means exactly
anovas for 2 means and more
Name two types of averages
mean and median are both averages, two different ways to assess what is typical they are measures of central tendency
Name 3 types of t-tests
independent = each row is a different person
paired sample = one row provides one than one mean
one sample t-test = one mean comparing to zero
is the thing that I found different from zero
Should ANOVA and correlational analysis agree
The ANOVA did not find a significant difference because it was comparing each mean groups
however, the correlational analysis is not comparing means but instead finding the line of best fit : do not have to agree since they are based on different hypothesis
Even if there is not a major difference from one year to the next, the correlation can pick up on the change over time
if you have a correlation of 0.170, the r2 would be the portion that is explained by the variable
How many predictors in correlation versus linear regression
the amount of predictors is the only difference where a linear regression can have more than one predictor, whereas a correlation assumes is only one
What is the different between R from correlation and linear regression in Jamovi
Using the linear regression option in Jamovi, it calculates R2 first and then ‘unsquares’ it. This is important because the value will always be positive
Which betas are standardized vs unstandardized
the beta in the R is standardized using Z-scores
used to describe the strength of the relationship
the beta in the predictor is unstandardized
used to draw the line of best fit
Describe correlations
The correlation coefficient (r) is a statistic that represents the strength of the relationship among two continuous variables.
The correlation coefficient gets a p-value attached by working in assumptions about how likely it is to find a relationship as strong as the one we observed by chance alone, given the sample size
In this way, significance is determined just like it is for any of the other statistics we’ve discussed
The relationship being described is always a straight line – or the line of best fit
What is the basic regression model
The regression model is our attempt to use a straight line to represent the relationship between two variables.
The formula for a regression line of best fit can be written out as:
Y = bX + c
The strength of this relationship is the beta coefficient
It describes the slope of the straight line as an unstandardized measure
It’s equivalent to R as a standardized measure, as long as we have only two variables in the model
b = R (only with 2 variables)
c is the intercept, aka point to which they cross the vertical axis
Describe the two kinds of beta
Unstandardized
Retains the original units of measurement for the variables
Difficult (or, realistically, often impossible) to compare against other betas in multiple regression model
Standardized
Converts variables to Z scores before calculating correlation
Allows for easy comparison against other betas in multiple regression model
e.g. is this predictor stronger than this one
Describe the Multiple Regression Model
The multiple regression model is essentially just a regression, where we add more lines to the model.
It can be written out as:
Y = b1X1 + b2X2 + c
Or, more completely, as the Field textbook likes to write it:
Y𝑖 = (𝑏0 + 𝑏1X1𝑖 + 𝑏2X2𝑖) + 𝑒𝑖
You can have as many (bnXn) combinations as you like, to suit your needs
Each one is a different IV within your model
When you add an IV, you change how the variability in the data are interpreted, and hopefully how much is explained
Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)
Describe the Three Kinds Of Variability
Total Variability
This represents the variability between the observed scores and the most basic statistical model
x - xbar
Residual Sum of Squares
This represents the variability between the observed scores and the line of best fit
This variability is indicated by the blue dashed lines
Now we’re comparing observations against the solid blue line instead of the orange line
x - line of best fit
Model Sum of Squares
This represents the variability between the overall mean model and the line of best fit
If this value is large, then the regression model is better than the mean model
line of best fit - xbar
How can you calculate the proportion of variance explained
Using the first and last of these variability models together, we can calculate the overall proportion of variance explained by a multiple regression model.
R2 = SSm/SSt
R is just like r but it’s meant for more than two variables
SSt is the Total Variability
SSm is the Model Sum of Squares
we lose the direction of the relationship because we square the values and get rid of the negative values
How can you calculate the f statistic of model fit
Using the Residual Sum of Squares and the Model Sum of Squares together, we can calculate the F statistic of model fit.
But two conversions are needed. SS –> MS
MSm = SSm/k MSr = SSr/N-(k-1) F = MSm/MSr
F statistics help us understand if the multiple regression model predicts the relationship between our predictor and the outcome better than the mean model.
What is the p-value of F statistic telling us
The p-value helps us decide if we are confident enough that the regression model is strong enough to reject the null hypothesis
What can you do from a non-significant R model
A non-significant R model can still point us to how we can simplify our model to regain power and diminish the lost degrees of freedom
Name the first 8 assumptions of multiple regression
- Linearity
- No Perfect Multicollinearity
- Independent Errors
- Homoscedasticity
- Normally Distributed Errors
- No Missing Predictors
- All Variables are Continuous
- All Variables Must Vary
Describe the assumption of linearity
We need to be able to deal with straight lines
In the simplest terms, the outcome variable (DV) should be linearly, not curvilinearly, related to each predictor (IV)
Ideally, the influences of the predictors are additive
That is, they each provide independent prediction of the DV, they each add a unique contribution to the equation and doesn’t overlap with the other predictors
Describe the assumption of multicollinearity
Multicollinearity is when your model includes two or more predictors that are highly correlated.
There are two statistics that can help us detect it: Tolerance and VIF
Tolerance: Ranges from 0 – 1, and a tolerance value below .10 is likely a problem
VIF: Starts at 1 and goes up, values above 10 are likely a problem
to fix it, we must remove one of the predictor to reduce the collinearity
Describe VIF
The Variance Inflation Factor is calculated by performing a multiple regression using only the IVs (or only the predictors). The actual DV is not included in the analysis, so VIF is the same regardless of your DV
VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2
In other words, how much is the proportion of the variance one IV can be explained by the other IV
Describe Tolerance
Tolerance is the bottom part of the VIF equation:
VIF = 1/1-R2
Tolerance = 1-R2
Will always agree with VIF