Linear Regression Flashcards
simple linear regression, multiple linear regression, model selection, diagnostics
What is regression?
A way to study relationships between variables.
What are the two main reasons we’d use regression?
- description and explanation (genuine interest in the nature of the relationship between variables)
- prediction (using variables to predict others)
What are linear regression models?
- contain explanatory variable(s) which help us explain or predict the behaviour of the response variable
- assume constantly increasing or decreasing relationships between each explanatory variable and the response
What structure does a linear model have?
response = intercept + (slope x explanatory variable) + error
yi = β0 + β1xi + ∈i
What is the intercept of a linear model?
β0
- response variable when the explanatory variables are 0
- where the regression cuts the vertical axis
What is the slope of a linear model?
β1, gradient of the regression line
What is the error term of a linear model?
∈i
- not all data follows the relationship exactly
- ∈i allows fo deviations
- normally distributed in the y dimension (zero mean, variance is estimated as part of the fitting process)
What is the Least Square (LS) Criterion?
- can be used to fit the regression
- finds parameters that minimise:
Σ (data - model)^2
What is a residual?
The vertical distance between the observed data and the best fit line.
How is the slope estimated?
β1(hat) = (Σ (xi-x̄) * yi) / (Σ (xi-x̄)^2)
x̄ is the mean explanatory variable
How is the intercept estimated?
β0(hat) = y̅ - (β1(hat) * x̄)
x̄ is the mean explanatory variable
y̅ is the mean of the response
How is the variance estimate calculated?
s^2 = (1/(n - k - 1))*Σ (yi - yi(hat))^2
n is number of observations, k is number of slope parameters estimated
How do we work out how much of the total observered variation has been explained?
Work out the proportion of unexplained variation and - from 1:
R^2 = 1 - ((Σ(yi - y(hat))^2)/(Σ(yi - y̅)^2))
R^2 = 1 - (SSerror/SStotal)
numerator: square error
demoninator: total sum of squares
What is the definition of the best line?
One that minimises the residual sums-of-squares.
What are the main reasons to use multiple covariates?
- description (interest in findinf relationship between such variables)
- prediction (knowledge of some will help us predict others)
What is added to a simple regression model to make it a multiple regression model?
More explanatory variables (of the form βp*xpi).
What model is used for the noise of a multiple regression model?
Normal distribution, 0 mean, variance σ^2.
What are dummy variables?
- switch on (x=1) or off (x=0) depending on level of the factor variable
- first of the group acts as baseline, rest switch on when applicable (n-1 variables)
What is parameter inference?
In order to make general statements about model parameters we can generate ranges of plausible values for these parameters and test “no-relationship” hypotheses.
What test statistic value is used when calculating the confidence intervals for slope parameters?
t(α/2, df=N-P-1)
N: total number of observations
P: number of explanatory variables fitted in the model
What is the null hypothesis for parameter inference?
H0: βp(hat) = 0
H1: βp(hat) does not equal 0
What is the equation for the adjusted R^2?
Adjusted R^2 = 1 - ((N - 1)*(1 - R^2)/N - P - 1)
N: total number of observations
P: number of explanatory variables fitted in the model
R^2: squared correlation
What is the standard error for the prediction on xp (xp any value)?
se(y(hat)) = sqrt(MSE * ((1/n)+(((xp - x̄)^2)/(Σ(xi - x̄^2)))))
MSE: mean square error/residual from ANOVA table
Why do we want an appropriate amount of covariates in our model? What happens if theres too many/few? What if the model is too simple/complex?
too few: throw away valuable info
non-esential variables: se and p-value tend to be too large
too simple/complex: model will have poor predictive abilities
What happens when collinear variables are put together in a model?
- model is unstable
- inflated se
What are Variance Inflation Factors (VIFs)?
Detect collinearity.
VIF = 1/(1 - R^2)
R^2 squared correlation
How should variables be removed?
One at a time.
How does p-value based model selection work?
- covariates with one associated coefficient, retention can be based on the associated p-value (large p-values suggest omission)
What type of regression models does the F-test work on? What can we use for other models?
Nested models. Can use AIC or BIC on both nested and non-nested models.
What is Akaike’s Information Criterion (AIC)?
The smaller the AIC value, the better the model.
AIC = -2*log-likelihood value + 2P
P: number of est. parameters
log-likelihood: calculated using the est. parameters in the model
What is AICc?
Used when sample size isnt much larger than the number of parameters in the model.
AICc = AIC + (2P(P + 1))/(N - P - 1)
N»P then AICc -> AIC
What is BIC?
Differs from AIC by employing a penalty that charges with the sample size (N).
BIC = -2log-likelihood value + log(N)P
What values of BIC represent a better model?
Smaller BIC values.
How are AIC weights calculated?
Δi(AIC) = AICi - minimum AIC
wi(AIC) = exp{-1/2Δi(AIC)}/(Σ e{-1/2Δk(AIC)})
What is interaction?
- Similar to ‘syndergy’ in chemistry. Non-additive effect (eg A = +10, B = +20, A+B = -10)
- interaction term is significant then p-values associated with main effect are irrelevent
- interactions should always come last in the sequence of predictors
What values can R^2 take?
Between 0 and 1.
What assumptions do we make about the errors of a linear model?
We assume one Normal distribution provides the (independant) noise.
How do we assess Normality?
- qualitative assessment from plotting (histogram of residuals, QQ-norm plot)
- formal test of normality (wilks-shapiro)
What do QQ-norm plots tell us? And how are they formed?
- plot quantiles of two sets of data against one another
- shapes are similar -> get straight line (y=x) -> data normally dist.
- residuals in ascending order, standardised (divide by sd), plotted against normal dist.
How is the ith point on a QQ-norm plot found?
p(i) = i/(n+1)
What is the Shapiro-Wilks test?
- tests for normality
- H0: data is normally dist.
What is the Breusch-Pagan test?
- a model which satifies the constant error variance assumption would produce a plot with a horizontal line
How do we assess independance?
- Durbin-Watson test (H0: uncorreleated errors)
- independnce can be violated in ways that cannot be tested (eg pseudoreplication)
How can we tell what variable in a signal causes non-linearity?
Use partial (residual) plots. These are found by adding the estimated relationship (for pth predictor βp*xpi) to the residuals (ri) of the model.
When do we bootstrap (for linear regression models)?
- horrible dist of residuals
- reasonably happy with signal model
- independant isnt and issue
What values can correlation take?
The correlation coefficient (r) can take values between -1 and 1. (These pretty much correspond to gradients of straight lines).
How is the significance of r calculated?
t = r*sqrt(n - 2) / sqrt(1 - r^2)
Causaulity implies causation. True/False?
True, but not the other way around.