Reading 1: Linear Regression Flashcards
Correlation Equation
Covariance of X and Y / (sample SD of X)(Sample SD of Y)
Slope coefficient
Cov(X,Y)/ (standard deviation^2)
Sum of squared errors (SSE)
- sum of the squared vertical distances between the estimated and actual Y-values is referred
- the regression line is the line that minimizes the SSE. This explains why simple linear regression is frequently referred to as ordinary least squares (OLS) regression, and the values determined by the estimated regression equation, are called least squares estimates.
Linear regression assumptions
- he linear relationship exists between the dependent and the independent variables.
- The variance of the residual term is constant for all observations (homoskedasticity).
- The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation (meaning that the paired X and Y observations are independent of each other).
- The residual term is normally distributed.
Heteroskedasticity
- occurs when the variance of the residuals differs across observations.
- We can see that the model residuals are more widely dispersed around higher values of X than around lower values of X. If these observations were chronological, then it appears that the model accuracy has declined over time.
- We can see that the model residuals are more widely dispersed around higher values of X than around lower values of X. If these observations were chronological, then it appears that the model accuracy has declined over time.
Serial Correlation (Independence)
-If the observations (X and Y pairs) are not independent, then the residuals from the model will exhibit serial correlation
Total sum of squares (SST)
- measures the total variation in the dependent variable.
- SST is equal to the sum of the squared differences between the actual Y-values and the mean of Y.
- total variation = explained variation + unexplained variation
Regression sum of squares (RSS)
- measures the variation in the dependent variable that is explained by the independent variable.
- RSS is the sum of the squared distances between the predicted Y-values and the mean of Y.
Sum of squared errors (SSE)
- measures the unexplained variation in the dependent variable. It’s also known as the sum of squared residuals or the residual sum of squares.
- SSE is the sum of the squared vertical distances between the actual Y-values and the predicted Y-values on the regression line.
Regression (explained)
-df = 1
- Sum of squares = RSS
-Mean sum of squares
MSR = RSS/k = RSS/1 = RSS
Error (unexplained)
-df = n-1
- Sum of squares = SSE
-Mean sum of error
MSR = SSE/(n-2)
Total
df: n-1
Standard Error of Estimate
- standard deviation of its residuals. The lower the SEE, the better the model fit.
- SEE =√MSE
COEFFICIENT OF DETERMINATION (R2)
- defined as the percentage of the total variation in the dependent variable explained by the independent variable.
- R2 = RSS / SST
- For simple linear regression (i.e., with one independent variable), the coefficient of determination, R2, may be computed by simply squaring the correlation coefficient, r. In other words, R2 = r2 for a regression with one independent variable.
F statistic
- F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable.
- F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variable.
F = MSR/MSE = (RSS/k)/(SSE/n-k-1)
- Important: This is always a one-tailed test!
- dfnumerator = k = 1
- dfdenominator = n − k − 1 = n − 2