1. Correlation And Regression Flashcards
True or false: with the exception of the extremes r=1 or r=-1, we cannot really speak of the strength of the relationship indicated by the correlation coefficient without a statistical test of significance?
True.
Test whether the correlation between the population of two variables is equal to zero, assuming that two populations are normally distributed.
H-0: p = 0
H-1: p <> 0
t = [r.(n-2)^(1/2)]/[(1-r^2)^(1/2)]
To make a decision, the calculated test is compared with critical-t value for the appropriate degrees of freedom and level of significance. Bearing in mind that we are conducting a two-tailed test, the decision rule can be stated as:
Reject H-0 if +t-critical
Describe the six assumptions underlying linear regression.
As indicated in the following list, most of the major assumptions pertain to the regression model’s residual term.
- A linear relationship exists between the dependent and the independent variable
- The independent variable is uncorrelated with the residuals
- The expected value of the residual term is zero, E(e)=0
- The variance of the residual term is constant for all observations
- The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another observation
- The residual term is normally distributed
In a linear regression, how is calculated the slope term?
Slope term equals covariance divided by variance:
b1= cov(x,y)/ var(x)
In a linear regression, how is calculated the intercept term?
The intercept term may be expressed as:
b0 = (mean of Y) - b1. (mean of x)
The intercept equation highlights the fact that the regression line passes through a point with coordinates equal to the mean or the independent and dependent variables, the point (mean of X, mean of Y)
Define ‘standard error of estimate (SEE)’.
The standard error of estimate (SEE) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SEE gauges the “fit” of the regression line.
The SEE is the standard deviation of the error terms in the regression. As such, SEE is also referred to as the standard error of the residual, or standard error of the regression.
How interpret the ‘coefficient of determination (R^2)’? Give an example, R^2=… means…
The coefficient of determination is defined as the percentage of the total variation in the dependent variable explained by the independent variable. For example, an R^2 of 0,63 indicates that the variation of the independent variable explains 63% of the variation in the dependent variable.
- compute R^2 by simply squaring the correlation coefficient is not appropriate when more than one independent variable is used in the regression.
How to calculate and interpret regression coefficient confidence interval for regression coefficients?
The confidence interval for the regression coefficient , b1, is calculated as:
b1 +- (t-critical x s(b1))
t-critical: the critical two-tailed t-value for the selected confidence level with degrees of freedom equal to the number of sample observations minus 2, n-2.
s(b1): standard error of the regression coefficient. It is highly unliked that you will have to calculate s(b1) on the exam.
A frequently asked question is whether an estimated slope coefficient is statistically different from zero.
If the confidence interval at the desired level of significance does not include zero, the null is rejected, and the coefficient is said to be statistically significant different from zero.
Formulate a null and alternative hypothesis about the true slope, b1, be equal to some hypothesized value.
Letting b1’ be the point estimate for b1, the appropriate test statistic with n-2 degrees of freedom is:
t(b1) = (b1’ - b1)/s(b1’)
Reject H-0 if t>t-critical or t
Define SST, RSS, SSE.
- SST (Total Sum of Squares): measures the total variation in the dependent variable. SST is equal to the sum of the squared differences between the actual-Y values and the mean of Y.
- RSS (Regression Sum of Squares): measures the variation in the dependent variable that is explained by the independent variable. RSS is the sum of the squares distances between the predicted Y-values and the mean of Y.
- SSE (Sum of Squared Errors): measures the unexplained variation in the dependent variable. It’s also known as the sum of squared residuals or the residual sum of squares. SSE is the sum of the squared vertical distances between the actual Y-values and the predicted at-values on the regression line.
SST = RSS + SSE
Define and calculate MSR and MSE.
- MSR (Mean Regression Sum of Squares) = RSS/k
* MSE (Mean Squared Error) = SSE/(n-k-1)
How to calculate R^2?
Recall that compute R^2 by simply squaring the correlation coefficient is not appropriate when more than one independent variable is used in the regression
Tip: use SST, SSE and RSS.
R^2 = (SST - SSE)/SST
= RSS/SST
= (total variation - unexplained variation)/ total variation
= explained variation/total variation
Define and calculate SEE.
SEE is the standard deviation of the regression error terms and is equal to the square root of the mean squared error (MSE):
SEE = (MSE)^(1/2) = (SEE/(n-k-1))^(1/2)
Make sure you recognize the distinction between the sum of squared errors (SSE) and the standard error of estimate (SEE). SSE is the sum of the squared residuals, while SEE is the standard deviation of the residuals.
Calculate and interpret the F-statistic.
An F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable.
In a multiple regression, the F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variable.
F = MSR/MSE
THIS IS ALWAYS A ONE-TAILED TEST.
Which are the degrees of freedom of numerator and denominator in a test-F?
df(numerator)= k
df(denominator)= n - k - 1