HBX- BA - 4 Flashcards
Regression Analysis
One of the most powerful and commonly used statistical tools. Has two primary purposes:
- Used to identify a mathematical relationship between a dependent variable and one or more independent variables.
- Regression analysis can be used to forecast the behavior of the dependent variable and/or to better understand the nature of the relationship between the dependent and the independent variable(s).
What is the regression line??
- the regression line is the line that minimizes the dispersion of points around that line, and we measure the accuracy of the regression line by measuring that dispersion.
- (on the graph below) We attribute the difference between the actual data points and the values predicted by the regression line either to relationships between selling price and variables other than house size or to chance alone.
Linear Regression
A specific form of regression analysis that examines the linear relationship between a dependent variable and one or more independent variables. Linear regression analysis identifies the “best fit line,” the line that minimizes the sum of squared error terms between the observed values in the sample and the predicted values that lie on the regression line. This best-fit line is called the regression line.
Single variable linear regression
Single variable linear regression can be seen as an extension of hypothesis testing. We have learned to use hypothesis tests to determine whether or not there is a significant relationship between two variables. Single variable linear regression seeks to identify a linear relationship between two variables
A single variable regression line can be described by the equation:
ŷ = a + bx
- ŷ is the expected value of y, the dependent variable, for a given value of x.
- a is the y-intercept of the line the point at which the regression line intersects the vertical axis. This is the value of ŷ when the independent variable, x, is set equal to 0.
- b is the slope, the average change in the dependent variable y as the independent variable x increases by one.
- x is the independent variable, the variable we are using to help us predict or better understand the dependent variable.
Single Variable Linear Regression analysis is used to identify the best fit line between two variables. This analysis builds on two previous concepts we have used to study relationships between two variables:
- Scatter plots, which are useful for visualizing a relationship between two variables.
- The correlation coefficient, a value between -1 and 1 that measures the strength and direction (positive or negative) of the linear relationship between two variables.
As we learned in earlier in the course, we typically use Greek letters (like σ) to refer to the “true” parameters associated with a population and Latin letters (like s) to refer to the estimates of those parameters we calculate from sample data. Similarly, we refer to the best fit line we obtain from our sample data as ŷ =a+bx to distinguish it from ŷ =α+βx, the idealized equation that represents the “true” best fit line.
Because the best fit line does not perfectly fit even the population data, we add an error term, ε, to the true equation: y=α+βx+ε.
The error term is the difference between the actual value of y and the expected value of y. That is, ε=y−ŷ.
REVIEW- How to:
- Make a Scatter Plot in Excel
- Find the Correlation Coefficient in Excel
Create a Scatter Plot in Excel
- Insert -> Scatter -> Scatter With Only Markers
- Input Y range (Ex: C1:C11)
- Input X range (Ex: B1:B11)
- Check the “Labels in First Box”
correlation coefficient
A measure of the strength of a linear relationship between two variables. The correlation coefficient can range from -1 to +1. A correlation coefficient of -1 indicates a perfect negative linear relationship between two variables, whereas a correlation coefficient of +1 indicates a perfect positive linear relationship. A correlation coefficient of 0 indicates that no linear relationship exists between two variables, though it is possible that a non-linear relationship exists between the two variables.
250
Pick two points on the x-axis—let’s say 1,000 and 2,000—and see what the corresponding points are on the y-axis. According to the regression line, the expected selling price of a 1,000 square foot house is approximately $250,000, and for a 2,000 square foot house is approximately $500,000. Therefore, as house size increases by 1,000 square feet, price increases, on average, by approximately $250,000. To find the average change in price as house size increases by one square foot, we divide $250,000 by 1,000. We find that as house size increases by one square foot, price increases, on average, by approximately $250.
Option D
Since there is no obvious linear relationship between the variables, a line that is almost horizontal is most accurate. The line is positioned close to the average y-value.
How to Add the Best Fit Line to a Scatter Plot
Create a Scatter Plot in Excel
-
Insert -> Scatter -> Scatter With Only Markers
- Input Y range (Ex: C1:C11)
- Input X range (Ex: B1:B11)
- Check the “Labels in First Box”
-
Insert -> Chart Tools ->Layout ->Trendline
- Check the Display Equation box to display the equation of the best fit line
Given the regression equation,
Selling Price= 13,490.45 + 255.36(HouseSize),
which of the following values represents the average change in selling price as house size increases by one square foot?
255.36
255.36 dollars/square foot is the line’s slope, which is equal to the average change in selling price as house size increases by one square foot.
Given the regression equation, Selling Price = 13,490.45 + 255.36(HouseSize),
what value represents the value of HouseSizeHouseSize at which the regression line intersects the horizontal axis?
- 52.83 square feet
The regression line intersects the horizontal axis when Selling Price = $0, that is, when House Size = -52.83 square feet. 13,490.45+ 255.36*(-52.83)=$0.00 (actually, -52.82914, which rounds to -52.83).
Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize), which of the following values represents the value of SellingPriceSellingPrice at which the regression line intersects the vertical axis?
$13,490.45
13,490.45 is the y-intercept, the value at which the regression line intersects the y-axis. This happens when House Size = 0, giving the equation: Selling Price = 13,490.45+255.36*0 = 13,490.45
Given the general regression equation, ŷ =a+bx, which of the following describes ŷ? Select all that apply.
- The expected value of y
- The expected value of x
- The independent variable
- The dependent variable
- The value we are trying to predict
- The intercept
- The expected value of y
- The dependent variable
- The value we are trying to predict
How to Forcast in Excel
Using the equation ŷ =a+bx
- Plug in the numbers! (See photo below)
Use Excel’s FORECAST function: (we didn’t use this in the module, but it was mentioned)
=FORECAST(x, known_y’s, known_x’s)
- x is the data point for which you want to predict a value.
- known_y’s is the dependent array or range of data.
- known_x’s is the independent array or range of data.
- In order to use this function we must have the original data. This approach also gives us a point forecast, but does not provide other helpful values that Excel’s regression tool produces.
WHATEVER THE CASE, MAKE SURE THE PREDICTION IS WITHIN THE RANGE OF HISTORICAL DATA OR IT IS NOT A GOOD FORECAST.
Prediction Interval
Rather than predicting just a single point, we construct an interval, or range, around the point forecast.
A prediction interval is a range of values constructed around a point forecast. The center of the prediction interval is the point forecast, that is, the expected value of y for a specified value of x. The range of the interval extends above and below the point forecast. The width of the interval is based on the standard error of the regression and the desired level of confidence in the prediction.
- The center of the prediction interval is the point forecast– in the case below, about $525,000. The standard error of the regression– in this case, about $151,000– is a reasonable but conservative estimate of the forecast’s standard deviation. The standard error of the regression is easily found in a regression output table.
- we have to choose a level of confidence for our prediction interval. A 95% prediction interval would run about two standard deviations above and below the point forecast. To forecast the price of a 2,000-square-foot home, the 95% prediction interval would be about $525,000 plus or minus two times $151,000.
- With this, we are able to say that we are 95% confident that the actual selling price will fall within the prediction interval.
- Since there is greater uncertainty when we forecast further from the mean of the independent variable, we can infer that the prediction interval should be wider as we move away from the average house size. So although the standard error is a reasonable estimate on which to base our range, the actual calculation is more complicated. As we move towards and then beyond the edges of the historical data, the width of the distribution around the point forecast increases. In this case, a 95% prediction interval for the selling price of a 7,000-square-foot home would be much wider than that for a 2,000-square-foot home.
(4. 3.2)
The best point forecast for the selling price of a 2,500 square foot house is the expected selling price of a 2,500 square foot home, approximately 13,490 + 255.36(2,500) = $652,000. Given that the standard error of the regression is about $151,000, which of the following would give the BEST estimate for the prediction interval for a 2,500 square foot home with approximately 95% confidence?
$652,000 ± 2($151,000)
A prediction interval is centered at a point forecast, in this case $652,000. The standard error of the regression is multiplied by 2 since we wish to estimate the prediction interval at the 95% confidence level. Note that we are using 2 to approximate the z-value for a 95% prediction interval. The actual z-value corresponding to 95% (for sufficiently large samples) is 1.96.
The standard error of the regression is….
is a reasonable but conservative estimate of the forecast’s standard deviation
How would the width of the actual prediction interval (at a 95% confidence level) for a 3,000 square foot home differ from the width of the actual prediction interval (at a 95% confidence level) for a 2,000 square foot home, given that the average home size is approximately 1,750 square feet?
The width of the actual prediction interval for a 3,000 square foot home would be larger than the width of the prediction interval for a 2,000 square foot home.
Because 3,000 square feet is further from the mean house size (1,750 square feet) than 2,000 square feet, the actual prediction interval at 3,000 square feet will be wider.
The width of the actual prediction interval is based on both the standard error of the regression and the distance from the mean; the actual prediction interval gets wider as the value of the independent variable moves further from the mean of the independent variable.
The image below compares prediction intervals created using both of the methods we have discussed. The red dashed lines show the actual prediction intervals for different house sizes. The blue dashed lines represent our method of estimating the prediction interval using the standard error of the regression. Note that the actual prediction intervals widen as house size moves further from the mean whereas the estimate prediction intervals, do not. They are parallel to the regression line.
Yes
3,500 lies well within the range of our historical housing data, so we can feel relatively comfortable with this prediction.
Given the regression equation, SellingPrice = 13,490.45 + 255.36(HouseSize), what do you expect the selling price of a 425 square foot home to be?
What is the vertical distance between a data point and the line?
The Residual Error.
This error is the difference between the observed value and the line’s prediction for the dependent variable. This difference may be due to other factors that influence selling price or just a plain chance. Collectively, the residuals for all the data points measure how accurately a line fits a data set.
Variation unexplained by the regression line!
The Sum of Squared Errors, or the Residual Sum of Squares
The amount of variation that is not explained by the regression line. The residual sum of squares is equal to the sum of the squared residuals, that is, the sum of the squared differences between the observed values of the dependent variable and the predicted values of the dependent variable. To calculate the residual sum of squares, subtract the regression sum of squares from the total sum of squares.
For this, we take the square of each distance and then add all of those squared terms together.
A regression line is formally defined as the line that minimizes the sum of squared errors.