Week 8 - Linear models: regression Flashcards
linear regression
Assumption that relationship between variables is linear
Normal distribution
Numerical response variable
method of least squares
The line that fits the data dots best
Least squares regression line: line for which the sum of all squared deviations in Y is smallest.
forumla of a line
Y = a + bX
Slope: of a linear regression is the rate of change in Y per unit of X
populations and samples
Regression line from sample should be the population mean
predicted values
Predicted values: of Y from a regression line estimate the mean of Y for all individuals having a given value of X.
residuals
Residual: of a point is the different between its measured Y value and the Y value predicted by the regression line
standard error of a slope
Uncertainty associated with the sample estimate
testing hypotheses about a slope
Evaluate if the population slope equals the null hypothesised value which is typically zero
T statistic is used with degrees of freedom
t-test for regression slope
Find the best slope - see if it is positive, negative or zero
Work our standard error of slope with mean square residual
Calculate t statistic and compare it to t distribution
See if p value is significant
F statistic or anova approach
F test used instead of t
Null hypothesis is slope is 0
Can be used when the test is two-sided and the null hypothesis slope is zero
Does not mean we are using ANOVA, just the ANOVA table for the F statistic
using R squared
Use R square to measure fraction of variation in Y that is explained by X in the linear regression
If R squared is close to one, then X predicts most of the variation in Y
assumptions of regression
At each value of X there is a population of possible Y values whose mean lies on the true regression line
At each of the X values the distribution of Y values is normal
The variance of Y values is the same at all values of X
At each value of X the Y measurements represent a random sample from the population of possible Y values
Linear relationship
Residual plots and QQ plots check assumptions
Extreme residuals can violate variance assumption
what do we do about violations?
Ignore if not too drastic
Transform variables if necessary (log etc)
Transformations can be linearising
If there are 0s in data set, add 1 to all data points so they are not lost when transformed to log
Square root is good for Y that are counts
Arcsine square root is good for Y that are proportions
transformations
Log transformation is easiest
Power and exponential relationships are also common
precision of predictions
Predicting the mean of Y or a data point of Y from X is a prediction
Mean predictions have higher precision than predicting a single data point