Linear Regression Flashcards
simple linear regression, multiple linear regression, model selection, diagnostics
What is regression?
A way to study relationships between variables.
What are the two main reasons we’d use regression?
- description and explanation (genuine interest in the nature of the relationship between variables)
- prediction (using variables to predict others)
What are linear regression models?
- contain explanatory variable(s) which help us explain or predict the behaviour of the response variable
- assume constantly increasing or decreasing relationships between each explanatory variable and the response
What structure does a linear model have?
response = intercept + (slope x explanatory variable) + error
yi = β0 + β1xi + ∈i
What is the intercept of a linear model?
β0
- response variable when the explanatory variables are 0
- where the regression cuts the vertical axis
What is the slope of a linear model?
β1, gradient of the regression line
What is the error term of a linear model?
∈i
- not all data follows the relationship exactly
- ∈i allows fo deviations
- normally distributed in the y dimension (zero mean, variance is estimated as part of the fitting process)
What is the Least Square (LS) Criterion?
- can be used to fit the regression
- finds parameters that minimise:
Σ (data - model)^2
What is a residual?
The vertical distance between the observed data and the best fit line.
How is the slope estimated?
β1(hat) = (Σ (xi-x̄) * yi) / (Σ (xi-x̄)^2)
x̄ is the mean explanatory variable
How is the intercept estimated?
β0(hat) = y̅ - (β1(hat) * x̄)
x̄ is the mean explanatory variable
y̅ is the mean of the response
How is the variance estimate calculated?
s^2 = (1/(n - k - 1))*Σ (yi - yi(hat))^2
n is number of observations, k is number of slope parameters estimated
How do we work out how much of the total observered variation has been explained?
Work out the proportion of unexplained variation and - from 1:
R^2 = 1 - ((Σ(yi - y(hat))^2)/(Σ(yi - y̅)^2))
R^2 = 1 - (SSerror/SStotal)
numerator: square error
demoninator: total sum of squares
What is the definition of the best line?
One that minimises the residual sums-of-squares.
What are the main reasons to use multiple covariates?
- description (interest in findinf relationship between such variables)
- prediction (knowledge of some will help us predict others)
What is added to a simple regression model to make it a multiple regression model?
More explanatory variables (of the form βp*xpi).
What model is used for the noise of a multiple regression model?
Normal distribution, 0 mean, variance σ^2.
What are dummy variables?
- switch on (x=1) or off (x=0) depending on level of the factor variable
- first of the group acts as baseline, rest switch on when applicable (n-1 variables)
What is parameter inference?
In order to make general statements about model parameters we can generate ranges of plausible values for these parameters and test “no-relationship” hypotheses.