Regression, GLMs and beyond Flashcards
What test do we use if we want to consider the relationship between continuous predictor and response variables?
Regression
(Or correlation)
What does the line of best fit in least squares regression do?
Minimises the squared deviations of the datapoints from the line
Should you just calculate the line of best fit without looking at the data?
No, lots of different patterns of data will return the same line of best fit, there is no substitute for plotting the data
What does a correlation coefficient tell us?
Tells you about the strength of the correlation between two variables
What is the correlation coefficient symbol?
Rho
What values can the correlation coefficient take?
Between -1 and 1
What does a correlation coefficient of 1 tell us?
Our data lies along a perfect straight line with a positive gradient
What does a correlation coefficient of -1 tell us?
Our data lies along a perfect straight line with a negative gradient
Does the correlation coefficient tell us anything about the gradient of the line?
No, it just tells us how well the datapoints lie along the line
What is Pearson’s correlation for?
Linear relationships between two continuous variables
Non-parametric equivalent of Pearson’s correlation
Spearman’s rank correlation
Can be used when the relationship is not linear
General Linear Model for a categorical predictor
Y = A0 + (B1, B2.. B how many levels of the predictor) + e
Y = variable you’re predicting
A0 = constant
B terms = effect of categorical predictor variable
e = error (normally distributed)
General Linear Model for continuous predictors
Y = A0 + A1x1 + A2x2 + error
Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
A2x2 = gradient of relationship with predictor variable x2
e = error (normally distributed)
General Linear Model for both categorical and continuous predictors
Y = A0 + A1x1 + (B1, B2….) + e
Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
B terms = effect of categorical predictor variable
e = error
What is the test statistic for a GLM?
F ratio
F = treatment mean square / error mean square
(explained variation (signal) / unexplained variation (noise))
What does (Intercept) tell us on the R summary output for a regression model?
The y-intercept of the first group (or the group that isn’t mentioned by name lower down on the summary)
The y-intercept is the number under the estimate column on this row
How to find the intercept for the other line?
Look for the variable name data$variable you are looking for
The number in the estimate column on this row is the difference between the y-intercept of the other groups line and the line you are looking for
What does the middle row of the r output tell us? (data$lnMass in the lecture slides)
The gradient of the relationship between the two variables
In the lecture slides it is the gradient of the relationship between ln(Mass) and ln(brain size) (body mass and brain size)
When there is no interaction, the gradient is the same for both lines
How to calculate the amount of variation in the data that we have explained by the model
R^2 = 1 - residual sum of squares / total sum of squares
This is because if we calculate how much of the variation the model hasn’t explained, we can subtract this from 1 to find out how much it has explained
The closer R^2 is to 1, the better the model (as more of the variation has been explained by the model)
What is interpolation?
Making predictions within the range of the data that we have
Usually a reasonable thing to do
What is extrapolation?
Making predictions outside of the range of the data that we have
Usually meaningless, be wary of extrapolation
Simpson’s Paradox
A combination of things can add up to create a relationship which is not the truth (or hide a relationship)
Including the right explanatory factors can help us to understand this relationship and see if the relationship changes when the variables are grouped
What happens if our relationship isn’t linear?
Transform the data to make the relationship linear
Use polynomial terms in the GLM
Have to include the linear term and the polynomial term
What if the relationship is not well described by polynomials?
Generalised linear models allow response variables to have errors which are not normally distributed
Logistic regression
Particular form of a generalised linear model which uses a binomial distribution which is often used for binary or survival data
Fixed effects
An explanatory variable where…
- The level of the explanatory variable is meaningful
- We wish to draw inferences about the effects of that particular level of the explanatory variable on the response variable
- If you repeated the experiment, it is possible to repeat exactly the same levels of that variable
Random effects
An explanatory variable where…
- The level of the explanatory variable is not meaningful (difference between being participant 1 and participant 5)
- We do not need to draw inferences about the effect of a specific level of the explanatory variable on the response variable
- If you repeated the experiment, you wouldn’t be able to repeat exactly the same levels of the variable, but you would be able to draw new levels (or individuals) from the same population
What does independence of error mean?
Knowing something about the error associated with one datapoint tells you nothing about the error associated with any other datapoint
What is independence of error an assumption of?
General linear models
Mixed models
Allow us to include both random and fixed explanatory variables in our model
By identifying an explanatory variable as _____ we can fit models which accurately account for the different sources of variation in the dataset
Random
What can we use to decide whether an explanatory factor should be included in the model? (to determine the importance of the different effects in mixed models)
Likelihood ratio tests
What do we mean by likelihood?
The probability of observing our data, given the model
What does one likelihood tell us?
Not a lot, but comparing likelihoods can tell us a lot
Model with higher likelihood will give you the better model
How to determine whether an explanatory factor is important…
- Fit a model with lmer() containing all the explanatory factors of interest
- Fit a new model with lmer() leaving out one of the explanatory factors
- Run a likelihood ratio test to determine if there is a significant difference in likelihood between the two models
- If removing the factor does make a significant difference to the likelihood of the model, that is evidence that the factor is important
Random intercepts model
Assumes that the relationship between two factors have the same slope but the intercept of the slope can change according to the person
Random slopes and random intercepts model
Assumes that the relationship between the two factors does not have the same slope for each person and the intercept can also change according to the person
Argument for using the simpler model…
Sometimes we just want to find the simplest model that explains the data
We can examine whether including the random slopes makes a difference to the model and leave it out if it doesn’t
Argument for using the more complicated model…
If we want to test if something really has an effect on something else, it is better to keep it maximal as this will give the best representation of real life and should give the “best” answer