Quant Methods #9 - Correlation and Regression Flashcards
covariance (def) of two random variables
LOS 9.a
a statistical measure of the degree to which two variables move together.
- covariance captures the linear relationship between two variables
- positive covariance : variables tend to move together
- negative covariance: variables tend to move in opposite directions
covXY =
LOS 9.a
covXY = E(i=1 to n) (Xi - Xmean)(Yi - Ymean) / (n-1)
where:
- n = sample size
- Xi = ith observation on variable X
- Xmean = mean of X1 to n
- <span>Y</span>i = ith observation on variable Y
- <span>Y</span>mean = mean of Y1 to n
Why is covariance not very meaningful but correlation coefficient is?
LOS 9.a
Covariance:
- extremely sensitive to the scale of the two variables
- range is -infinity to +infinity
- presented in terms of squared units
Correlation coefficient converts covariance into a standardized measure that is easier to interpret.
sample correlation coefficent, rXY = ?
LOS 9.a
rXY = covXY / sX sY
- sX = sample standard deviation of X
- sY = sample standard deviation of Y
Interpret for rXY (sample correlation coefficient):
r = +1
0 < r < 1
r = 0
-1 < r < 0
r = -1
LOS 9.a
perfect positive linear correlation
positive linear relationship
no linear relationship
negative linear relationship
perfect negative linear correlation
Note that for r = 1 and r = -1 the data points lie exactly on a line, but the slope is not necessarily +1 or -1.
What are the limitiations to correlation analysis?
LOS 9.b
- Outliers - can significantly influence computed correlation to give false relationship (or lack thereof)
- Spurious Correlation - appearance of linear relationship when data is correlated purely by chance e.g.stock prices vs. snow fall amounts
- Nonlinear Relationships - does not capture strong nonlinear relationships
How does one test for significance of a correlation of the population, p (rho), of two variables (from the sample correlation results)?
LOS 9.c
Test whether the correlation between the population of the two variables is equal to zero using the following null and alternative hypotheses for two-tailed test with n-2 degrees of freedom (df):
H0:p = 0 versus Ha:p != 0
test statistic t = r * sqrt(n-2) / sqrt(1 - r2)
Then compare computed t with the critical t-value for the appropriate degrees of freedom and level of significance. For a two-tailed test, the decision rule is stated as:
Reject H0 if +tcritical < t or t < -tcritical
Distinguish between the dependent and independent variables of a linear regression
LOS 9.d
- dependent variable - its variation is explained by the independent variable (e.g. “Y” values), aka explained, endogenous, or predicted variable
- independent variable - explains the variation of the dependent variable (e.g. “X” values), aka explanatory, exogenous, or predicting variable.
Describe the six assumptions underlying linear regression
LOS 9.e
except for #1, it’s all about the residuals!
For X (independent) and Y (dependent) variables:
- linear relationship exists between X and Y
- X is uncorrelated with residuals, e
- The expected value of the residual term is zero:
ê = 0, also noted as E(e) = 0 - The variance of the residual term is constant for all observations: E(ei2) = σe2
- The residual term is independently distributed, i.e. residual for each observation is uncorrelated with all others: E(eiej) = 0, j != i
- The residual term is normally distributed
Interpret the linear regression coefficients
LOS 9.e
For linear relationship:
Yi = b0 + b1Xi + ei, i=1…n
the regression line equation is:
^Yi= ^b0 + ^b1^Xi , i=1…n ( ^ equals “hat” or “estimated”)
- ^b1 = covXY / sX2 ; “slope = cov / variance”; stock’s ß
- ^b0 = Ymean - ^b1 Xmean ; y-intercept; stock’s alpha
Standard error or estimate (def.)
LOS 9.f
Standard error of estimate (SEE) is the standard deviation of the error terms in the regression.
also called:
standard error of the residual
standard error of the regression
SEE measures the degree of variability of the actual Y-values realtive to estimated Y-values from a regression equation.
The SEE gauges the “fit” of the regression line.
The smaller the the standard error, the better the fit.
Coefficient of Determination (def.) for simple linear regression
LOS 9.f
Coeffient of determination (R2) i sthe percentage of the toal variation in the dependent variable (Y) explained by the independent variable (X).
For simple linear regression (not for multi-variate regression),
R2 = r2, where
r = sample correlation coefficient
Regression coefficient (^b1) confidence interval (equation)
LOS 9.f
^b1 +/- (tc x s^b1), where
tc = critical two-tailed t-value for the selected confidence level for df = n-2
Test for significance about a population value of a regression coefficient (e.g. b1)
LOS 9.g
Use two-tailed t-test with df = n-2:
tb1 = (^b1 - b1) / s^b1, where
b1 = the hypothesized value.
H0: b1 = 0; Ha: != 0
reject H0 if t < -tc or tc < t, which means that b1 is significantly different from the hypothesized value
For a simple linear regression, how does one predict the value of the dependent variable (Y)?
LOS 9.h
^Y = ^b0 + ^b1Xp, where
^Y = predicted value of the dependent variable
Xp = forecasted value of the dependent variable