Regression, GLMs and beyond Flashcards
What test do we use if we want to consider the relationship between continuous predictor and response variables?
Regression
(Or correlation)
What does the line of best fit in least squares regression do?
Minimises the squared deviations of the datapoints from the line
Should you just calculate the line of best fit without looking at the data?
No, lots of different patterns of data will return the same line of best fit, there is no substitute for plotting the data
What does a correlation coefficient tell us?
Tells you about the strength of the correlation between two variables
What is the correlation coefficient symbol?
Rho
What values can the correlation coefficient take?
Between -1 and 1
What does a correlation coefficient of 1 tell us?
Our data lies along a perfect straight line with a positive gradient
What does a correlation coefficient of -1 tell us?
Our data lies along a perfect straight line with a negative gradient
Does the correlation coefficient tell us anything about the gradient of the line?
No, it just tells us how well the datapoints lie along the line
What is Pearson’s correlation for?
Linear relationships between two continuous variables
Non-parametric equivalent of Pearson’s correlation
Spearman’s rank correlation
Can be used when the relationship is not linear
General Linear Model for a categorical predictor
Y = A0 + (B1, B2.. B how many levels of the predictor) + e
Y = variable you’re predicting
A0 = constant
B terms = effect of categorical predictor variable
e = error (normally distributed)
General Linear Model for continuous predictors
Y = A0 + A1x1 + A2x2 + error
Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
A2x2 = gradient of relationship with predictor variable x2
e = error (normally distributed)
General Linear Model for both categorical and continuous predictors
Y = A0 + A1x1 + (B1, B2….) + e
Y = variable you’re predicting
A0 = constant
A1x1 = gradient of relationship with predictor variable x1
B terms = effect of categorical predictor variable
e = error
What is the test statistic for a GLM?
F ratio
F = treatment mean square / error mean square
(explained variation (signal) / unexplained variation (noise))
What does (Intercept) tell us on the R summary output for a regression model?
The y-intercept of the first group (or the group that isn’t mentioned by name lower down on the summary)
The y-intercept is the number under the estimate column on this row
How to find the intercept for the other line?
Look for the variable name data$variable you are looking for
The number in the estimate column on this row is the difference between the y-intercept of the other groups line and the line you are looking for
What does the middle row of the r output tell us? (data$lnMass in the lecture slides)
The gradient of the relationship between the two variables
In the lecture slides it is the gradient of the relationship between ln(Mass) and ln(brain size) (body mass and brain size)
When there is no interaction, the gradient is the same for both lines
How to calculate the amount of variation in the data that we have explained by the model
R^2 = 1 - residual sum of squares / total sum of squares
This is because if we calculate how much of the variation the model hasn’t explained, we can subtract this from 1 to find out how much it has explained
The closer R^2 is to 1, the better the model (as more of the variation has been explained by the model)
What is interpolation?
Making predictions within the range of the data that we have
Usually a reasonable thing to do
What is extrapolation?
Making predictions outside of the range of the data that we have
Usually meaningless, be wary of extrapolation
Simpson’s Paradox
A combination of things can add up to create a relationship which is not the truth (or hide a relationship)
Including the right explanatory factors can help us to understand this relationship and see if the relationship changes when the variables are grouped
What happens if our relationship isn’t linear?
Transform the data to make the relationship linear
Use polynomial terms in the GLM
Have to include the linear term and the polynomial term
What if the relationship is not well described by polynomials?
Generalised linear models allow response variables to have errors which are not normally distributed