Chapter 16: Simple Linear Regression And Correlation Flashcards
Regression analysis
A technique used to predict the value of one variable on the basis of other variables
Requires developing an equation that describes the relationship between the variable to be forecast (dependant variable) and variables the practitioner believes it to be relate to (independent variables)
Correlation analysis
Technique used to determine if a relationship exists between two variables
Deterministic models
Equations that allow us to determine the value of the dependant variable from the values of the independent variable
Probabilistic model
Models that include a method to represent the randomness of real live processes
Starts with a deterministic model and then adds a term to measure the random error of the deterministic component
Error variable
Represented by epsilon
Variance between actual data point and estimated data point from a model
Accounts for all variables (measurable and immeasurable) that are not part of the model
First-order linear model
Aka simple linear regression model
Aka straight-line model
Includes only one independent variable
Y=B0 + B1x + e
Y= dependant variable x= independent variable B0= y-intercept B1= slope of the line (rise/run) e= error variable
(So y=Mx+B + error variable)
X and y must both be interval data
Coefficients B0 and B1 are population parameters (almost always unknown, so must estimate)
Least squares line coefficients
For y-hat=bO+b1x
b1= sample covariance of x and y / sample variance of x
b0= sample mean of y - (b1* sample mean of x)
Sample variance
s^2= sum of each value of (x- mean x)^2/ n-1
Shortcut= 1/(n-1)*(sum of all values of x^2- ((sum of all values of x)^2)/n)
Excel: VAR function
Sample covariance
Sxy= sum of ((all values x - mean x)*(all values of y * mean y))/n-1
Shortcut= (1/(n-1))* ((sum of all values xy)-((sum of all valuessum of all values y)/n))
Excel: COVAR function
Least squares method
Produces a straight line that minimizes the sum of the squared differences between the actual points and the line
Residuals
The deviations between the actual data points and the least squares line (ei)
ei= y(actual) - y-hat (calculated)
Observations of the error variable
Sum of squares for error
Minimized sum of squared deviations between observed y and calculated y
SSE
Regression analysis in excel
Type x and y data into two columns (cannot have missing data)
Go to data, data analysis, regression
Input y range and x range
Intercept coefficient is b0 (intercept)
X data coefficient is b1 (slope)
Inferences from least squares line
Coefficients are only about sample data. Not ready to be used as inferences for population parameters
Intercept isn’t necessarily the value of y when x= 0 just an estimate based on the rest of the data, but generally values of y can’t be reliably determined for a value of x outside the range of the sample values
Required conditions for the error variable
1) probability distribution of e is normal
2) the mean of the distribution is 0; that is E(e)=0
3) the standard deviation of e is sigma e, which is a constant regardless of the value of x
1-3: for each value of x,y is a normally distributed random variable whose mean is E(y)=B0 + B1x whose standard deviation is sigma e
4) the value of e associated with any particular value of y is independent of e associated with any other value of y
Methods to assess the regression model
- Standard error of estimate
- t test of slope
- coefficient of determination
All based on the sum of squared for error
Sum of squares for error
SSE: minimized sum of squared deviation (between the data points and the line defined by the coefficients)
Shortcut calculation of SSE
= (n-1)(sample variance y -(sample covariance of x and y(squared)/ sample variance of x))
Standard error of estimate
Standard deviation of errors determines fit: if large fit is poor, if small fit is good
Must use sample standard deviation to estimate population
Standard deviation of error variable= square root of (SSE/n-2)
Also standard error value in excel regression statistics
Smallness or largeness of se judged by comparing it to the sample mean of the dependant variable. If small then can say relatively small.
Very useful for comparing models. Not useful as an absolute measure
Testing the slope
Horizontal line (slope = 0) implies lack of linear relationship (B1 = slope)
Test of the slope is a hypothesis test where:
H0: B1= 0 (aka, no linear relationship)
H1: B1 =/= 0 (two tail test)
Test statistic for b1
t=(sample slope - population slope) / standard error of sample slope)
(Standard error of sample slope = standard error of estimate /(square root of (n-1)* sample variance of the independent variable))
v= n-2
Confidence interval estimator of the population slope (B1)
Sample slope (B1) +/- t((a/2)*standard error of sample slope)
v=n-2
One tail tests
One tail tests can be used to test if there is a positive or negative linear relationship between the variables
H1: B1< 0 looks for a negative linear relationship
H1: B1 > 0 looks for a positive linear relationship
Same test statistic, just have to divide the p-value by 2
Coefficient of determination
Measure of the strength of a linear relationship between variables (how much of the variation in the dependant variable that can be explained by variation in the independent variable)
R2 = s^2 xy/ s^2x * s^2y
(Covariance of x and y / sample variance x * sample variance of y)
Or
R2= 1- (SSE / (sum of all values (y - mean of y) squared)
Essentially explained variation/ total variation in y
R square value in excel regression analysis
The higher the value of R2 the better the model fits the data
ANOVA table
Part of excel regression analysis: analysis of variance table
Shows sources of variation in y
Regression = SSR = variation in y explained by x
Error (residual) = SSE = variation in y still unexplained
SS = sum of squares MS= mean of squares (ss/df)
F statistic = MSR/MSE (mean of squares regression/mean of squares error)
Cause and effect relationship
Remember: correlation between values of x and y is not necessarily x determining y. Could be an unknown factor determining both. Cannot tell from statistics alone. Need a reasonable theoretical relationship
Sample coefficient of correlation
r= sxy/ sx*sy
Sample coefficient of correlation= sample covariance / sample variance x * sample variance of y
Determines whether there is a linear relationship between two variables
Use for observational data with two bivariate normally distributed variables
Test statistic for testing that p (population coefficient of correlation) = 0
t= r(square root of ((n-2)/ (1-r^2)))
V= n-2
Provided variables are bivariate normally distributed
Can also do one tail tests to check for p<0 and p> 0
Violation of required condition
When the normality requirement is unsatisfied we can use the spearman rank correlation coefficient (a nonparametric technique) to replace the t-test of p