QM LM10 Simple Linear Regression Flashcards
How do you calculate line of best fit?
- Minimise the total sum of the squares of the distance from the line to the observations
What does a linear regression assume?
A linear relationship between the dependent and independent variables
What is SST?
- Total sum of the squares of distances between the observations and any given line (i.e., mean with a slope of 0, line of best fit, and anything in between)
- SST also equals SSE, which is sum of square of errors
- This is because we can think of the distance from the given line as error
What is the residual error term?
The portion of the dependent variable that cannot be explained by the independent variable (when using a line of best fit)
Y = intercept + slope coefficient + error term
These three are regression coefficients
Y is regressed on x
What is the residual?
The error term in the coefficient for a line that describes a correlation.
- Or, the portion of y (dependent variable) that cannot be explained by x (independent variable)
When does our value of y equal the intercept?
When x = 0
However, this only makes sense if the independent variable has meaning at x=0
For example, if we are talking about height versus age, age = 0 has no meaning regardless of what the y value is as someone cannot be 0 in age
What are the 4 assumptions behind using a simple linear regression to find a relationship?
- Linearity: the relationship between x and y is linear (eg rather than curvy)
- Homoscedasticity: variance of the error terms is the same for all observations (eg rather than different variances at different times)
- Independence: the pairs (x,y) are independent of each other. One xy that we choose should be independent of the next xy
- Normality: the error is normalyl distributed
What would indicate non linearity?
All the error terms for low values of x are negative
And all the error terms for high values of x are positive
Below the line -> above the line
Suggests a nonlinear (curvy/polynominal/log/other) relationship
What would indicate serial correlation?
Negative error terms follow negative error terms
Positive error terms follow positive error terms
(a line of plots rather than a more scattered plot around the mean ie +-+-+-)
What is coefficient of determination?
Measures the fraction of the total variation in the dependent variable that is explained by the independent variable
- It is a goodness of fit measure, but it does not tell us about the significance of the regression equation (which requires factoring in the sample size): it is NOT a statistical test
- Therefore we have to do an F-test: to test total variation over explained variation
In what case might the relationship between the parameters be linear but the data is not linear?
If there is a fixed percentage change from the previous year
- There will be an exponential shape curve
- Here the assumption of linearity is not being violated
- However a linear model may still not be appropriate
When the relationship between parameters is linear but the data is nonlinear, how can you tackle it?
- In this case, linearity has not been violated, but using the raw data, a linear model may not be appropriate
- When you have this scenario you can either change the model to be non linear, or transform the data to be linear
What would indicate serial correlation?
When the error term of the previous observation can predict the next
- I.e., if the last is positive the next is probably positive
- There is a consisent trend in some regions of error terms being positive for a while or negative for a while
- If there was no serial correlation error terms would randomly appear above and below the regression line across the series of observations
What is the difference between SST, SSE and SSR?
- Total sum of squares (SST) is the observation on y, minus the mean, squared, totalled across all observations
- SST can be broken down into the amount that can be explained, and unexplained
- Sum of squared errors (SSE) is the unexplained part
- Regression sum of squares is the explained part
What is one use of a log-linear model?
When the growth rate is constant
- Under such a scenario the absolute change will be increasing exponentially
- Therefore we put the dependent variable (y) in a log scale