ANOVA, Correlation, Regression, Multiple Regression, Hierarchical Models, Mediation Flashcards

1
Q

Describe the basic ANOVA

A

The basic ANOVA is used when we have a predictor that is categorical, and an outcome that is continuous.

We look at the p-value of the F statistic to determine whether there is a significant difference

If significant, it tells us there is some difference, somewhere in the means of our conditions
We usually follow up with post-hoc tests of some sort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the F statistic

A

The F statistic represents the ratio of explained variance to unexplained variance in our model.

F = Goodness of model/Badness of model = Signal/Noise

The F-test can tell whether our multiple regression model is a good one, given the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the model we’re testing in ANOVA?

A

It’s the suggestion that there is a difference between the categories
So, we’re essentially comparing the evidence in favour of a difference against the evidence that there’s nothing at all going on

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many conditions to do a post-hoc test

A

When we have at least 3 conditions we can do a variety of post-hoc tests for a basic ANOVA .

When we have only 2 conditions there’s no need for post-hocs (we’re already basically doing a t-test)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name 4 post-hoc tests

A

Least Significant Difference (LSD) (boo!)
same as a t-test
does not adjust the type 1 error

Bonferroni (yikes!)
very strict adjustment
likely to make a type 2 error

Tukey’s Honest Significant Difference (HSD)
not as strict as Bonferroni, but still more restriction than the LSD test

Dunnett
If we’re only interested in the comparisons against a control:
need to tell Jamovi with one is the control condition so all comparisons are based on it
more power to find significant differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name 3 basic ANOVAs

A

One-way ANOVA: Tests a single predictor against any number of outcomes (all tested independently)

Univariate ANOVA: Tests one or more predictors against a single outcome (just called ANOVA in jamovi)

Repeated Measures ANOVA: Tests one or more within-groups predictors against a single outcome
Requires data to be laid out differently than the other two

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How many means in t-test vs ANOVAs

A

t-test for 2 means exactly

anovas for 2 means and more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name two types of averages

A

mean and median are both averages, two different ways to assess what is typical they are measures of central tendency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Name 3 types of t-tests

A

independent = each row is a different person

paired sample = one row provides one than one mean

one sample t-test = one mean comparing to zero
is the thing that I found different from zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Should ANOVA and correlational analysis agree

A

The ANOVA did not find a significant difference because it was comparing each mean groups
however, the correlational analysis is not comparing means but instead finding the line of best fit : do not have to agree since they are based on different hypothesis

Even if there is not a major difference from one year to the next, the correlation can pick up on the change over time

if you have a correlation of 0.170, the r2 would be the portion that is explained by the variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How many predictors in correlation versus linear regression

A

the amount of predictors is the only difference where a linear regression can have more than one predictor, whereas a correlation assumes is only one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the different between R from correlation and linear regression in Jamovi

A

Using the linear regression option in Jamovi, it calculates R2 first and then ‘unsquares’ it. This is important because the value will always be positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which betas are standardized vs unstandardized

A

the beta in the R is standardized using Z-scores
used to describe the strength of the relationship

the beta in the predictor is unstandardized
used to draw the line of best fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe correlations

A

The correlation coefficient (r) is a statistic that represents the strength of the relationship among two continuous variables.

The correlation coefficient gets a p-value attached by working in assumptions about how likely it is to find a relationship as strong as the one we observed by chance alone, given the sample size

In this way, significance is determined just like it is for any of the other statistics we’ve discussed

The relationship being described is always a straight line – or the line of best fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the basic regression model

A

The regression model is our attempt to use a straight line to represent the relationship between two variables.

The formula for a regression line of best fit can be written out as:
Y = bX + c

The strength of this relationship is the beta coefficient
It describes the slope of the straight line as an unstandardized measure

It’s equivalent to R as a standardized measure, as long as we have only two variables in the model

b = R (only with 2 variables)
c is the intercept, aka point to which they cross the vertical axis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the two kinds of beta

A

Unstandardized
Retains the original units of measurement for the variables
Difficult (or, realistically, often impossible) to compare against other betas in multiple regression model

Standardized
Converts variables to Z scores before calculating correlation
Allows for easy comparison against other betas in multiple regression model
e.g. is this predictor stronger than this one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe the Multiple Regression Model

A

The multiple regression model is essentially just a regression, where we add more lines to the model.
It can be written out as:
Y = b1X1 + b2X2 + c
Or, more completely, as the Field textbook likes to write it:
Y𝑖 = (𝑏0 + 𝑏1X1𝑖 + 𝑏2X2𝑖) + 𝑒𝑖

You can have as many (bnXn) combinations as you like, to suit your needs
Each one is a different IV within your model
When you add an IV, you change how the variability in the data are interpreted, and hopefully how much is explained

Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe the Three Kinds Of Variability

A

Total Variability
This represents the variability between the observed scores and the most basic statistical model
x - xbar

Residual Sum of Squares
This represents the variability between the observed scores and the line of best fit
This variability is indicated by the blue dashed lines
Now we’re comparing observations against the solid blue line instead of the orange line
x - line of best fit

Model Sum of Squares
This represents the variability between the overall mean model and the line of best fit
If this value is large, then the regression model is better than the mean model
line of best fit - xbar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can you calculate the proportion of variance explained

A

Using the first and last of these variability models together, we can calculate the overall proportion of variance explained by a multiple regression model.
R2 = SSm/SSt

R is just like r but it’s meant for more than two variables
SSt is the Total Variability
SSm is the Model Sum of Squares
we lose the direction of the relationship because we square the values and get rid of the negative values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How can you calculate the f statistic of model fit

A

Using the Residual Sum of Squares and the Model Sum of Squares together, we can calculate the F statistic of model fit.

But two conversions are needed. SS –> MS

MSm = SSm/k
MSr = SSr/N-(k-1)
F = MSm/MSr

F statistics help us understand if the multiple regression model predicts the relationship between our predictor and the outcome better than the mean model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the p-value of F statistic telling us

A

The p-value helps us decide if we are confident enough that the regression model is strong enough to reject the null hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What can you do from a non-significant R model

A

A non-significant R model can still point us to how we can simplify our model to regain power and diminish the lost degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Name the first 8 assumptions of multiple regression

A
  1. Linearity
  2. No Perfect Multicollinearity
  3. Independent Errors
  4. Homoscedasticity
  5. Normally Distributed Errors
  6. No Missing Predictors
  7. All Variables are Continuous
  8. All Variables Must Vary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the assumption of linearity

A

We need to be able to deal with straight lines
In the simplest terms, the outcome variable (DV) should be linearly, not curvilinearly, related to each predictor (IV)

Ideally, the influences of the predictors are additive
That is, they each provide independent prediction of the DV, they each add a unique contribution to the equation and doesn’t overlap with the other predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Describe the assumption of multicollinearity

A

Multicollinearity is when your model includes two or more predictors that are highly correlated.
There are two statistics that can help us detect it: Tolerance and VIF

Tolerance: Ranges from 0 – 1, and a tolerance value below .10 is likely a problem

VIF: Starts at 1 and goes up, values above 10 are likely a problem
to fix it, we must remove one of the predictor to reduce the collinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe VIF

A

The Variance Inflation Factor is calculated by performing a multiple regression using only the IVs (or only the predictors). The actual DV is not included in the analysis, so VIF is the same regardless of your DV

VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2
In other words, how much is the proportion of the variance one IV can be explained by the other IV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Describe Tolerance

A

Tolerance is the bottom part of the VIF equation:
VIF = 1/1-R2
Tolerance = 1-R2

Will always agree with VIF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Describe the assumption of Independent Errors

A

For any two observations, the residuals should be uncorrelated. Residuals are the unexplained portion of variance (leftovers)
In other words, what is not explained (residuals) by the correlation should be random. If we violate this assumption, then the MR model p-value is invalid

29
Q

In what scenario is the assumption of independent error rarely violated

A

Fortunately, independence is rarely an issue for researchers in psychology since we work with people and it is unlikely that their residual would correlate

30
Q

When would you use the Dubin-Watson d test

A

To assess if the assumption of independent error is violated in a time-series data. You can use a statistical test called the Durbin-Watson d test to see if you have a big problem.
This test is called the Autocorrelation test in jamovi
Produces a value varying from 0 – 4, where optimal (no issue) is 2
You should treat values below 1 or above 3 as a cause for concern

31
Q

Describe the assumption of homoscedasticity

A

At each level of prediction, the variance of the residuals should be constant.
I.e., the height of a residuals plot should be the same all the way from left to right

If we violate this assumption, the p-values of our betas are invalid. On the bright side, the beta values themselves are still accurate

If you see a funnel shape or bowtie, back away slowly

Our human brain tends to overestimate the spread of the cloud, so one point on the side can look like a funnel shape, but in reality the majority of the points are behaving randomly

32
Q

Describe the assumption of normally distributed errors

A

We assume the residuals (unexplained variance) are random, normally distributed, and have an overall mean of 0.

If we violate this assumption, the p-values of our betas are again invalid. You can easily work around this problem by using large sample sizes.
Large sample sizes increase confidence, so it’s easy to trust our results are right regardless.
You can also use bootstrapping to overcome issues (in theory, because jamovi does not offer this option )

In jamovi: Under Assumption Checks, add the Q-Q plot of residuals. Look for big swings away from the 45-degree diagonal

33
Q

Name two ways to have invalid p-values of betas

A

either violating homoscedasticity or violating normally distributed errors

34
Q

Name two additional concerns to be aware of while running multiple regression

A
  1. You need a sufficient sample size
  2. You shouldn’t have outliers or, more accurately, influential cases. Because we want our line to represent the majority and not be strongly influenced by a few individual cases
35
Q

Why is sample size important

A

For any correlational analysis it’s important to have a reasonably large sample size.
If you don’t, there’s a high probability of getting unstable correlation coefficients (i.e., big changes across samples)

This problem extends to MR, and is amplified because you now have multiple unstable coefficients
General rule of thumb:
Have a bare minimum of 40 participants per predictor, but get double or triple that if you can

36
Q

Is it possible to have no missing predictors

A

We must accept that almost all of the time we will be missing at least one predictor that could impact the relationship observed

Remember that there are two components to the multiple regression model.
1. The overall model, including all variance shared with the DV (R)
2. The individual predictors, showing the unique shared variance with the DV (beta)
Every time you add or remove a predictor variable you change both components, fundamentally
The impact tends to be most evident on the betas
This is just the nature of partial correlations

37
Q

What is Partial Correlation

A

Imagine variance is a circle
For two variables you have two circles, and the shared variance would be the amount of overlap among the circles
This overlap is also the correlation
Which means it’s the beta too

38
Q

What is the relationship between partial correlations and dependence

A

In multiple regression you have at least three variance circles (e.g. cake icing example)

With three variance circles you also get three shared variances

Two of the shared variances are with the DV
One shared variance is among the IVs

Hiding inside the covariances with the DV is the shared variance among the IVs

39
Q

How can multiple regression help us with partial correlations

A

Multiple regression attempts to assign importance to the different predictors, in the context of all the others by partialling the variance.

Beta: The overlap of the IV and DV is examined after discarding any shared variance
This is just the nature of partial correlations

Through Beta we uncover the unique variance that is shared between the IV and the DV – it is, of course, only part of the total effect the IV has on the DV

When one (standardized) Beta is bigger than another, it has a stronger influence. In this case, orange icing does seem to taste better

40
Q

Why is it important that all variables must vary

A

If IV doesn’t change, then we cannot make any prediction of DV

If DV doesn’t change, then R2 cannot be calculated since there is no beta. y always equals the same regardless of the IV

When they both vary, we can make predictions by calculation R2, beta and drawing a line of best fit

41
Q

Do all variables have to be continuous

A

Yes, and no.

More accurately, the difference between smaller and larger numbers must always be meaningful.
Ideally, variables are measured using an interval or ratio scale, but a good ordinal scale works well enough too.

A 2-option categorical variable (coded as 0/1 preferably) also satisfies this requirement
A value of 1 must indicate more of that category than 0

42
Q

Can you use categorical variable in correlational analysis

A

when using two categorical variables, then the line of best fit simply connects both means and deciding if the slope is significant is the same as looking is the means are significantly different

therefore a significant slope = significant t-test

We cannot plot more than two categorical variables because the numbers are assigned arbitrarily, we can change the nature of the relationship by interchanging the order of the variables

43
Q

How can you define outliers

A

An influential outlier is either a single observation (or perhaps a very small number of observations) that doesn’t match the pattern established by the rest of the sample.

Influential outliers are a problem because they increase the error in every conclusion we want to make

Influential outliers are a concern for all regressions, but in a multiple regression we’re especially worried about multivariate outliers.

These are unusual observations based on combinations of variables
“Unusual” has a fuzzy definition, but should certainly not be more than 10% of your sample

44
Q

How can we calculate Multivariate Outliers

A
  1. Mahalanobis Distance: Looking for unusual combinations of predictors
  2. Cook’s Distance: Looking for unusual combinations of all variables

Unlike Tolerance and VIF, these two do not always agree.
Why are they both described as distances?
they are trying to identify the cluster of normality within that space of what the typical score looks like, then trying to figure out how far away is any individual observation from that cluster of normality

45
Q

Describe Cook’s Distance

A

This distance tends to find fewer outliers, if you follow traditional conventions. It produces a minimum score of 0 and goes up “to infinity, and beyond!”

Values above 1 are strongly indicative of outliers
Values below 1 can still be considered outliers, if they are both relatively large and very unlike the other distance values
The calculation looks at all variables in your regression model, including the dependent variable

46
Q

Describe Mahalanobis Distance

A

This distance is a multidimensional Z-score based only on the IVs in your model, that also begins with a minimum score of 0 and goes up.

Distance values above but near 1 are quite small here, and not indicative of outliers
The distance values are not directly interpretable; we need to calculate a p -value

The p-values for the distances are calculated by comparing against a χ2 distribution, taking into account the distance itself and the number of predictors in the model (degree of freedom)
We want to exclude p-values below .001 (not .05)

47
Q

How can you identify influential outliers using Mahalanobis Distance

A

All you really need are the row numbers, if provided
If no outliers are found, you will just see a message saying no outliers were found.
If there are outliers, you will get a table giving the Row #, Distance, and p-value for each outlier
As p is always < .001, if you want to see the number you will need to change jamovi’s default p-value format

48
Q

How can you identify influential outliers using Cook’s Distance

A

Look for values above 1 in the table or examine the plot
The numbers shown above the lines here are row numbers, the height of the line is the Cook’s Distance.
A visual inspection is the easiest way to spot irregularities, where scores are unusual despite not going above 1

49
Q

When should you exclude influential outliers

A

You should look at both the Cook’s Distance and Mahalanobis Distance results and exclude rows of data that were deemed outliers by either method.
In theory, you repeat this process until no new outliers are found

50
Q

What are hierarchical models

A

We don’t always have clear-cut models in mind for our multiple regressions. Other times we have very specific predictors in mind and want to test them.

Sometimes we want to go exploring, letting parsimony guide us
Sometimes we want to see how certain variables change the result

51
Q

Name two ways you can build Hierarchical Models

A

There are two ways to make hierarchical models.

  1. Build a model up from nothing by adding predictor variables in stages
  2. Take a complex model and make it simpler by removing predictors

Remember: Every variable you include in a model contributes to the definition of shared variance for that model, affecting the independent contributions of the other variables

The blocks are evaluated separately from each other – only the R2 change is affected by the hierarchy you create

52
Q

What is the logic of Hierarchical Models

A

Whether you are adding or removing predictor variables, you want to see what effect that change had.

As an overall effect, you want to see how R2 changed
You also want to know whether that amount of change was significant

If your model produced a significant change, you would then want to look at how the predictor betas were affected by the change in variables

If the change is not significant and we keep the predictor, we are going against parsimony which says that simpler assumptions make better models

53
Q

What does the R2 from the VIF and tolerance calculation refer to?

A

VIF takes each IV in turn, and uses it as a DV that’s predicted by the remaining IVs. The R2 of this regression tells us how much dependence there is, so:
VIF = 1/1-R2

Not using the R2 of the model, but from the new calculation between the relationship of all predictors

54
Q

What important assumption does the multiple regression model make?

A

Assumption of causality with the multiple regression model : which predictor (x) will impact the outcome (y)

55
Q

What happens to your VIF if you add a DV

A

nothing since it only uses IVs

56
Q

If you add an IV to your multiple regression analysis, what happens to your variability

A

you change how the variability in the data are interpreted, and hopefully how much is explained

57
Q

If our assumption of independent error is violated, what is compromised

A

our MR model p-value is invalid

58
Q

If our assumption of homoscedasticity is violated, what is compromised

A

the p-values of our beta, but not betas themselves

59
Q

If our assumption of normally distributed error is violated, what is compromised

A

he p-values of our beta again, but not betas themselves

60
Q

How can you fix the issues around the normally distributed error

A

large sample sizes increase confidence in results and use bootstrapping

61
Q

Why is sample size especially important in MR

A

high probability of getting unstable correlation coefficients, big changes across samples. This problem extends to MR, and is amplified because you now have multiple unstable coefficients

62
Q

What is the difference between R and beta

A

overall model prediction vs individual contribution to the predictor

63
Q

What should be the mean of residuals

A

0

64
Q

What is the difference between r and R

A

R is just like r but it’s meant for more than two variables

65
Q

Which statistics is used to describe the proportion of variance explained

A

R2

66
Q

Which statistics is used to describe the strength of association between predictors and outcomes

A

R

67
Q

Which statistics is used to decide whether our model is a good fit

A

F

68
Q

How should you look at Mahalanobis Distance

A

exclude p-values below 0.001, distance themselves are not directly interpretable

69
Q

Which unit is used to calculate Mahalanobis Distance

A

multidimensional Z-scores based only on IV of the model