ANOVA, Correlation, Regression, Multiple Regression, Hierarchical Models, Mediation Flashcards

Question 1

Q

Describe the basic ANOVA

Answer

A

The basic ANOVA is used when we have a predictor that is categorical, and an outcome that is continuous.

We look at the p-value of the F statistic to determine whether there is a significant difference

If significant, it tells us there is some difference, somewhere in the means of our conditions
We usually follow up with post-hoc tests of some sort

Question 2

Q

What is the F statistic

Answer

A

The F statistic represents the ratio of explained variance to unexplained variance in our model.

F = Goodness of model/Badness of model = Signal/Noise

The F-test can tell whether our multiple regression model is a good one, given the data

Question 3

Q

What is the model we’re testing in ANOVA?

Answer

A

It’s the suggestion that there is a difference between the categories
So, we’re essentially comparing the evidence in favour of a difference against the evidence that there’s nothing at all going on

Question 4

Q

How many conditions to do a post-hoc test

Answer

A

When we have at least 3 conditions we can do a variety of post-hoc tests for a basic ANOVA .

When we have only 2 conditions there’s no need for post-hocs (we’re already basically doing a t-test)

Question 5

Q

Name 4 post-hoc tests

Answer

A

Least Significant Difference (LSD) (boo!)
same as a t-test
does not adjust the type 1 error

Bonferroni (yikes!)
very strict adjustment
likely to make a type 2 error

Tukey’s Honest Significant Difference (HSD)
not as strict as Bonferroni, but still more restriction than the LSD test

Dunnett
If we’re only interested in the comparisons against a control:
need to tell Jamovi with one is the control condition so all comparisons are based on it
more power to find significant differences

Question 6

Q

Name 3 basic ANOVAs

Answer

A

One-way ANOVA: Tests a single predictor against any number of outcomes (all tested independently)

Univariate ANOVA: Tests one or more predictors against a single outcome (just called ANOVA in jamovi)

Repeated Measures ANOVA: Tests one or more within-groups predictors against a single outcome
Requires data to be laid out differently than the other two

Question 7

Q

How many means in t-test vs ANOVAs

Answer

A

t-test for 2 means exactly

anovas for 2 means and more

Question 8

Q

Name two types of averages

Answer

A

mean and median are both averages, two different ways to assess what is typical they are measures of central tendency

Question 9

Q

Name 3 types of t-tests

Answer

A

independent = each row is a different person

paired sample = one row provides one than one mean

one sample t-test = one mean comparing to zero
is the thing that I found different from zero

Question 10

Q

Should ANOVA and correlational analysis agree

Answer

A

The ANOVA did not find a significant difference because it was comparing each mean groups
however, the correlational analysis is not comparing means but instead finding the line of best fit : do not have to agree since they are based on different hypothesis

Even if there is not a major difference from one year to the next, the correlation can pick up on the change over time

if you have a correlation of 0.170, the r2 would be the portion that is explained by the variable

Question 11

Q

How many predictors in correlation versus linear regression

Answer

A

the amount of predictors is the only difference where a linear regression can have more than one predictor, whereas a correlation assumes is only one

Question 12

Q

What is the different between R from correlation and linear regression in Jamovi

Answer

A

Using the linear regression option in Jamovi, it calculates R2 first and then ‘unsquares’ it. This is important because the value will always be positive

Question 13

Q

Which betas are standardized vs unstandardized

Answer

A

the beta in the R is standardized using Z-scores
used to describe the strength of the relationship

the beta in the predictor is unstandardized
used to draw the line of best fit

Question 14

Q

Describe correlations

Answer

A

The correlation coefficient (r) is a statistic that represents the strength of the relationship among two continuous variables.

The correlation coefficient gets a p-value attached by working in assumptions about how likely it is to find a relationship as strong as the one we observed by chance alone, given the sample size

In this way, significance is determined just like it is for any of the other statistics we’ve discussed

The relationship being described is always a straight line – or the line of best fit

Question 15

Q

What is the basic regression model

Answer

A

The regression model is our attempt to use a straight line to represent the relationship between two variables.

The formula for a regression line of best fit can be written out as:
Y = bX + c

The strength of this relationship is the beta coefficient
It describes the slope of the straight line as an unstandardized measure

It’s equivalent to R as a standardized measure, as long as we have only two variables in the model

b = R (only with 2 variables)
c is the intercept, aka point to which they cross the vertical axis

Question 16

Q

Describe the two kinds of beta

Answer

A

Unstandardized
Retains the original units of measurement for the variables
Difficult (or, realistically, often impossible) to compare against other betas in multiple regression model

Standardized
Converts variables to Z scores before calculating correlation
Allows for easy comparison against other betas in multiple regression model
e.g. is this predictor stronger than this one

Question 17

Q

Describe the Multiple Regression Model

Answer

A

The multiple regression model is essentially just a regression, where we add more lines to the model.
It can be written out as:
Y = b1X1 + b2X2 + c
Or, more completely, as the Field textbook likes to write it:
Y𝑖 = (𝑏0 + 𝑏1X1𝑖 + 𝑏2X2𝑖) + 𝑒𝑖

You can have as many (bnXn) combinations as you like, to suit your needs
Each one is a different IV within your model
When you add an IV, you change how the variability in the data are interpreted, and hopefully how much is explained

Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)

Question 18

Q

Describe the Three Kinds Of Variability

Answer

A

Total Variability
This represents the variability between the observed scores and the most basic statistical model
x - xbar

Residual Sum of Squares
This represents the variability between the observed scores and the line of best fit
This variability is indicated by the blue dashed lines
Now we’re comparing observations against the solid blue line instead of the orange line
x - line of best fit

Model Sum of Squares
This represents the variability between the overall mean model and the line of best fit
If this value is large, then the regression model is better than the mean model
line of best fit - xbar

Question 19

Q

How can you calculate the proportion of variance explained

Answer

A

Using the first and last of these variability models together, we can calculate the overall proportion of variance explained by a multiple regression model.
R2 = SSm/SSt

R is just like r but it’s meant for more than two variables
SSt is the Total Variability
SSm is the Model Sum of Squares
we lose the direction of the relationship because we square the values and get rid of the negative values

Question 20

Q

How can you calculate the f statistic of model fit

Answer

A

Using the Residual Sum of Squares and the Model Sum of Squares together, we can calculate the F statistic of model fit.

But two conversions are needed. SS –> MS

MSm = SSm/k
MSr = SSr/N-(k-1)
F = MSm/MSr

F statistics help us understand if the multiple regression model predicts the relationship between our predictor and the outcome better than the mean model.

Question 21

Q

What is the p-value of F statistic telling us

Answer

A

The p-value helps us decide if we are confident enough that the regression model is strong enough to reject the null hypothesis

Question 22

Q

What can you do from a non-significant R model

Answer

A

A non-significant R model can still point us to how we can simplify our model to regain power and diminish the lost degrees of freedom

Question 23

Q

Name the first 8 assumptions of multiple regression

Answer

A

Linearity
No Perfect Multicollinearity
Independent Errors
Homoscedasticity
Normally Distributed Errors
No Missing Predictors
All Variables are Continuous
All Variables Must Vary

Question 24

Q

Describe the assumption of linearity

Answer

A

We need to be able to deal with straight lines
In the simplest terms, the outcome variable (DV) should be linearly, not curvilinearly, related to each predictor (IV)

Ideally, the influences of the predictors are additive
That is, they each provide independent prediction of the DV, they each add a unique contribution to the equation and doesn’t overlap with the other predictors

Question 25

Q

Describe the assumption of multicollinearity

Answer

A

Multicollinearity is when your model includes two or more predictors that are highly correlated.
There are two statistics that can help us detect it: Tolerance and VIF

Tolerance: Ranges from 0 – 1, and a tolerance value below .10 is likely a problem

VIF: Starts at 1 and goes up, values above 10 are likely a problem
to fix it, we must remove one of the predictor to reduce the collinearity

Question 26

Q

Describe VIF

Answer

A

The Variance Inflation Factor is calculated by performing a multiple regression using only the IVs (or only the predictors). The actual DV is not included in the analysis, so VIF is the same regardless of your DV

VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2
In other words, how much is the proportion of the variance one IV can be explained by the other IV

Question 27

Q

Describe Tolerance

Answer

A

Tolerance is the bottom part of the VIF equation:
VIF = 1/1-R2
Tolerance = 1-R2

Will always agree with VIF

Question 28

Q

Describe the assumption of Independent Errors

Answer

A

For any two observations, the residuals should be uncorrelated. Residuals are the unexplained portion of variance (leftovers)
In other words, what is not explained (residuals) by the correlation should be random. If we violate this assumption, then the MR model p-value is invalid

Question 29

Q

In what scenario is the assumption of independent error rarely violated

Answer

A

Fortunately, independence is rarely an issue for researchers in psychology since we work with people and it is unlikely that their residual would correlate

Question 30

Q

When would you use the Dubin-Watson d test

Answer

A

To assess if the assumption of independent error is violated in a time-series data. You can use a statistical test called the Durbin-Watson d test to see if you have a big problem.
This test is called the Autocorrelation test in jamovi
Produces a value varying from 0 – 4, where optimal (no issue) is 2
You should treat values below 1 or above 3 as a cause for concern

Question 31

Q

Describe the assumption of homoscedasticity

Answer

A

At each level of prediction, the variance of the residuals should be constant.
I.e., the height of a residuals plot should be the same all the way from left to right

If we violate this assumption, the p-values of our betas are invalid. On the bright side, the beta values themselves are still accurate

If you see a funnel shape or bowtie, back away slowly

Our human brain tends to overestimate the spread of the cloud, so one point on the side can look like a funnel shape, but in reality the majority of the points are behaving randomly

Question 32

Q

Describe the assumption of normally distributed errors

Answer

A

We assume the residuals (unexplained variance) are random, normally distributed, and have an overall mean of 0.

If we violate this assumption, the p-values of our betas are again invalid. You can easily work around this problem by using large sample sizes.
Large sample sizes increase confidence, so it’s easy to trust our results are right regardless.
You can also use bootstrapping to overcome issues (in theory, because jamovi does not offer this option )

In jamovi: Under Assumption Checks, add the Q-Q plot of residuals. Look for big swings away from the 45-degree diagonal

Question 33

Q

Name two ways to have invalid p-values of betas

Answer

A

either violating homoscedasticity or violating normally distributed errors

Question 34

Q

Name two additional concerns to be aware of while running multiple regression

Answer

A

You need a sufficient sample size
You shouldn’t have outliers or, more accurately, influential cases. Because we want our line to represent the majority and not be strongly influenced by a few individual cases

Question 35

Q

Why is sample size important

Answer

A

For any correlational analysis it’s important to have a reasonably large sample size.
If you don’t, there’s a high probability of getting unstable correlation coefficients (i.e., big changes across samples)

This problem extends to MR, and is amplified because you now have multiple unstable coefficients
General rule of thumb:
Have a bare minimum of 40 participants per predictor, but get double or triple that if you can

Question 36

Q

Is it possible to have no missing predictors

Answer

A

We must accept that almost all of the time we will be missing at least one predictor that could impact the relationship observed

Remember that there are two components to the multiple regression model.
1. The overall model, including all variance shared with the DV (R)
2. The individual predictors, showing the unique shared variance with the DV (beta)
Every time you add or remove a predictor variable you change both components, fundamentally
The impact tends to be most evident on the betas
This is just the nature of partial correlations

Question 37

Q

What is Partial Correlation

Answer

A

Imagine variance is a circle
For two variables you have two circles, and the shared variance would be the amount of overlap among the circles
This overlap is also the correlation
Which means it’s the beta too

Question 38

Q

What is the relationship between partial correlations and dependence

Answer

A

In multiple regression you have at least three variance circles (e.g. cake icing example)

With three variance circles you also get three shared variances

Two of the shared variances are with the DV
One shared variance is among the IVs

Hiding inside the covariances with the DV is the shared variance among the IVs

Question 39

Q

How can multiple regression help us with partial correlations

Answer

A

Multiple regression attempts to assign importance to the different predictors, in the context of all the others by partialling the variance.

Beta: The overlap of the IV and DV is examined after discarding any shared variance
This is just the nature of partial correlations

Through Beta we uncover the unique variance that is shared between the IV and the DV – it is, of course, only part of the total effect the IV has on the DV

When one (standardized) Beta is bigger than another, it has a stronger influence. In this case, orange icing does seem to taste better

Question 40

Q

Why is it important that all variables must vary

Answer

A

If IV doesn’t change, then we cannot make any prediction of DV

If DV doesn’t change, then R2 cannot be calculated since there is no beta. y always equals the same regardless of the IV

When they both vary, we can make predictions by calculation R2, beta and drawing a line of best fit

Question 41

Q

Do all variables have to be continuous

Answer

A

Yes, and no.

More accurately, the difference between smaller and larger numbers must always be meaningful.
Ideally, variables are measured using an interval or ratio scale, but a good ordinal scale works well enough too.

A 2-option categorical variable (coded as 0/1 preferably) also satisfies this requirement
A value of 1 must indicate more of that category than 0

Question 42

Q

Can you use categorical variable in correlational analysis

Answer

A

when using two categorical variables, then the line of best fit simply connects both means and deciding if the slope is significant is the same as looking is the means are significantly different

therefore a significant slope = significant t-test

We cannot plot more than two categorical variables because the numbers are assigned arbitrarily, we can change the nature of the relationship by interchanging the order of the variables

Question 43

Q

How can you define outliers

Answer

A

An influential outlier is either a single observation (or perhaps a very small number of observations) that doesn’t match the pattern established by the rest of the sample.

Influential outliers are a problem because they increase the error in every conclusion we want to make

Influential outliers are a concern for all regressions, but in a multiple regression we’re especially worried about multivariate outliers.

These are unusual observations based on combinations of variables
“Unusual” has a fuzzy definition, but should certainly not be more than 10% of your sample

Question 44

Q

How can we calculate Multivariate Outliers

Answer

A

Mahalanobis Distance: Looking for unusual combinations of predictors
Cook’s Distance: Looking for unusual combinations of all variables

Unlike Tolerance and VIF, these two do not always agree.
Why are they both described as distances?
they are trying to identify the cluster of normality within that space of what the typical score looks like, then trying to figure out how far away is any individual observation from that cluster of normality

Question 45

Q

Describe Cook’s Distance

Answer

A

This distance tends to find fewer outliers, if you follow traditional conventions. It produces a minimum score of 0 and goes up “to infinity, and beyond!”

Values above 1 are strongly indicative of outliers
Values below 1 can still be considered outliers, if they are both relatively large and very unlike the other distance values
The calculation looks at all variables in your regression model, including the dependent variable

Question 46

Q

Describe Mahalanobis Distance

Answer

A

This distance is a multidimensional Z-score based only on the IVs in your model, that also begins with a minimum score of 0 and goes up.

Distance values above but near 1 are quite small here, and not indicative of outliers
The distance values are not directly interpretable; we need to calculate a p -value

The p-values for the distances are calculated by comparing against a χ2 distribution, taking into account the distance itself and the number of predictors in the model (degree of freedom)
We want to exclude p-values below .001 (not .05)

Question 47

Q

How can you identify influential outliers using Mahalanobis Distance

Answer

A

All you really need are the row numbers, if provided
If no outliers are found, you will just see a message saying no outliers were found.
If there are outliers, you will get a table giving the Row #, Distance, and p-value for each outlier
As p is always < .001, if you want to see the number you will need to change jamovi’s default p-value format

Question 48

Q

How can you identify influential outliers using Cook’s Distance

Answer

A

Look for values above 1 in the table or examine the plot
The numbers shown above the lines here are row numbers, the height of the line is the Cook’s Distance.
A visual inspection is the easiest way to spot irregularities, where scores are unusual despite not going above 1

Question 49

Q

When should you exclude influential outliers

Answer

A

You should look at both the Cook’s Distance and Mahalanobis Distance results and exclude rows of data that were deemed outliers by either method.
In theory, you repeat this process until no new outliers are found

Question 50

Q

What are hierarchical models

Answer

A

We don’t always have clear-cut models in mind for our multiple regressions. Other times we have very specific predictors in mind and want to test them.

Sometimes we want to go exploring, letting parsimony guide us
Sometimes we want to see how certain variables change the result

Question 51

Q

Name two ways you can build Hierarchical Models

Answer

A

There are two ways to make hierarchical models.

Build a model up from nothing by adding predictor variables in stages
Take a complex model and make it simpler by removing predictors

Remember: Every variable you include in a model contributes to the definition of shared variance for that model, affecting the independent contributions of the other variables

The blocks are evaluated separately from each other – only the R2 change is affected by the hierarchy you create

Question 52

Q

What is the logic of Hierarchical Models

Answer

A

Whether you are adding or removing predictor variables, you want to see what effect that change had.

As an overall effect, you want to see how R2 changed
You also want to know whether that amount of change was significant

If your model produced a significant change, you would then want to look at how the predictor betas were affected by the change in variables

If the change is not significant and we keep the predictor, we are going against parsimony which says that simpler assumptions make better models

Question 53

Q

What does the R2 from the VIF and tolerance calculation refer to?

Answer

A

VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2

Not using the R2 of the model, but from the new calculation between the relationship of all predictors

Question 54

Q

What important assumption does the multiple regression model make?

Answer

A

Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)

Question 55

Q

What happens to your VIF if you add a DV

Answer

A

nothing since it only uses IVs

Question 56

Q

If you add an IV to your multiple regression analysis, what happens to your variability

Answer

A

you change how the variability in the data are interpreted, and hopefully how much is explained

Question 57

Q

If our assumption of independent error is violated, what is compromised

Answer

A

our MR model p-value is invalid

Question 58

Q

If our assumption of homoscedasticity is violated, what is compromised

Answer

A

the p-values of our beta, but not betas themselves

Question 59

Q

If our assumption of normally distributed error is violated, what is compromised

Answer

A

he p-values of our beta again, but not betas themselves

Question 60

Q

How can you fix the issues around the normally distributed error

Answer

A

large sample sizes increase confidence in results and use bootstrapping

Question 61

Q

Why is sample size especially important in MR

Answer

A

high probability of getting unstable correlation coefficients, big changes across samples. This problem extends to MR, and is amplified because you now have multiple unstable coefficients

Question 62

Q

What is the difference between R and beta

Answer

A

overall model prediction vs individual contribution to the predictor

Question 63

Q

What should be the mean of residuals

Question 64

Q

What is the difference between r and R

Answer

A

R is just like r but it’s meant for more than two variables

Answer 64

A

exclude p-values below 0.001, distance themselves are not directly interpretable

Answer 65

A

multidimensional Z-scores based only on IV of the model