Regression, GLMs and beyond Flashcards by Poppy Aves

What test do we use if we want to consider the relationship between continuous predictor and response variables?

Regression
(Or correlation)

How well did you know this?

Not at all

Perfectly

What does the line of best fit in least squares regression do?

Minimises the squared deviations of the datapoints from the line

How well did you know this?

Not at all

Perfectly

Should you just calculate the line of best fit without looking at the data?

No, lots of different patterns of data will return the same line of best fit, there is no substitute for plotting the data

How well did you know this?

Not at all

Perfectly

What does a correlation coefficient tell us?

Tells you about the strength of the correlation between two variables

How well did you know this?

Not at all

Perfectly

What is the correlation coefficient symbol?

Rho

How well did you know this?

Not at all

Perfectly

What values can the correlation coefficient take?

Between -1 and 1

How well did you know this?

Not at all

Perfectly

What does a correlation coefficient of 1 tell us?

Our data lies along a perfect straight line with a positive gradient

How well did you know this?

Not at all

Perfectly

What does a correlation coefficient of -1 tell us?

Our data lies along a perfect straight line with a negative gradient

How well did you know this?

Not at all

Perfectly

Does the correlation coefficient tell us anything about the gradient of the line?

No, it just tells us how well the datapoints lie along the line

How well did you know this?

Not at all

Perfectly

What is Pearson’s correlation for?

Linear relationships between two continuous variables

How well did you know this?

Not at all

Perfectly

Non-parametric equivalent of Pearson’s correlation

Spearman’s rank correlation

Can be used when the relationship is not linear

How well did you know this?

Not at all

Perfectly

General Linear Model for a categorical predictor

Y = A0 + (B1, B2.. B how many levels of the predictor) + e

Y = variable you’re predicting
A0 = constant
B terms = effect of categorical predictor variable
e = error (normally distributed)

How well did you know this?

Not at all

Perfectly

General Linear Model for continuous predictors

Y = A0 + A1x1 + A2x2 + error

Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
A2x2 = gradient of relationship with predictor variable x2
e = error (normally distributed)

How well did you know this?

Not at all

Perfectly

General Linear Model for both categorical and continuous predictors

Y = A0 + A1x1 + (B1, B2….) + e

Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
B terms = effect of categorical predictor variable
e = error

How well did you know this?

Not at all

Perfectly

What is the test statistic for a GLM?

F ratio

F = treatment mean square / error mean square

(explained variation (signal) / unexplained variation (noise))

How well did you know this?

Not at all

Perfectly

What does (Intercept) tell us on the R summary output for a regression model?

The y-intercept of the first group (or the group that isn’t mentioned by name lower down on the summary)

The y-intercept is the number under the estimate column on this row

How to find the intercept for the other line?

Look for the variable name data$variable you are looking for

The number in the estimate column on this row is the difference between the y-intercept of the other groups line and the line you are looking for

What does the middle row of the r output tell us? (data$lnMass in the lecture slides)

The gradient of the relationship between the two variables

In the lecture slides it is the gradient of the relationship between ln(Mass) and ln(brain size) (body mass and brain size)

When there is no interaction, the gradient is the same for both lines

How to calculate the amount of variation in the data that we have explained by the model

R^2 = 1 - residual sum of squares / total sum of squares

This is because if we calculate how much of the variation the model hasn’t explained, we can subtract this from 1 to find out how much it has explained

The closer R^2 is to 1, the better the model (as more of the variation has been explained by the model)

What is interpolation?

Making predictions within the range of the data that we have

Usually a reasonable thing to do

What is extrapolation?

Making predictions outside of the range of the data that we have

Usually meaningless, be wary of extrapolation

Simpson’s Paradox

A combination of things can add up to create a relationship which is not the truth (or hide a relationship)

Including the right explanatory factors can help us to understand this relationship and see if the relationship changes when the variables are grouped

What happens if our relationship isn’t linear?

Transform the data to make the relationship linear

Use polynomial terms in the GLM

Have to include the linear term and the polynomial term

What if the relationship is not well described by polynomials?

Generalised linear models allow response variables to have errors which are not normally distributed

Logistic regression

Particular form of a generalised linear model which uses a binomial distribution which is often used for binary or survival data

Fixed effects

An explanatory variable where... - The level of the explanatory variable is meaningful - We wish to draw inferences about the effects of that particular level of the explanatory variable on the response variable - If you repeated the experiment, it is possible to repeat exactly the same levels of that variable

Random effects

An explanatory variable where... - The level of the explanatory variable is not meaningful (difference between being participant 1 and participant 5) - We do not need to draw inferences about the effect of a specific level of the explanatory variable on the response variable - If you repeated the experiment, you wouldn't be able to repeat exactly the same levels of the variable, but you would be able to draw new levels (or individuals) from the same population

What does independence of error mean?

Knowing something about the error associated with one datapoint tells you nothing about the error associated with any other datapoint

What is independence of error an assumption of?

General linear models

Mixed models

Allow us to include both random and fixed explanatory variables in our model

By identifying an explanatory variable as _____ we can fit models which accurately account for the different sources of variation in the dataset

Random

What can we use to decide whether an explanatory factor should be included in the model? (to determine the importance of the different effects in mixed models)

Likelihood ratio tests

What do we mean by likelihood?

The probability of observing our data, given the model

What does one likelihood tell us?

Not a lot, but comparing likelihoods can tell us a lot Model with higher likelihood will give you the better model

How to determine whether an explanatory factor is important...

1. Fit a model with lmer() containing all the explanatory factors of interest 2. Fit a new model with lmer() leaving out one of the explanatory factors 3. Run a likelihood ratio test to determine if there is a significant difference in likelihood between the two models 4. If removing the factor does make a significant difference to the likelihood of the model, that is evidence that the factor is important

Random intercepts model

Assumes that the relationship between two factors have the same slope but the intercept of the slope can change according to the person

Random slopes and random intercepts model

Assumes that the relationship between the two factors does not have the same slope for each person and the intercept can also change according to the person

Argument for using the simpler model...

Sometimes we just want to find the simplest model that explains the data We can examine whether including the random slopes makes a difference to the model and leave it out if it doesn't

Argument for using the more complicated model...

If we want to test if something really has an effect on something else, it is better to keep it maximal as this will give the best representation of real life and should give the "best" answer