CHAPTER 6 / QUESTIONS Flashcards
After running a regression, you find that the model yields an SEE of 5,000. Is this a good result? What are the problems with using SEE as a measure of “goodness of fit”?
SEE is an absolute measure, meaning size alone does not tell us very much. In order to use SEE to analyze “goodness of fit”, we must convert it to a Coefficient of Variation (COV), by dividing the SEE by the mean. The COV tells us how well our model is doing in relative terms (percentage). A target COV for a good model is less than 10%.
Explain the difference between simple linear regression and multiple regression.
Multiple regression includes two or more independent variables; linear regression includes only one variable. Multiple regression is difficult to depict spatially since it involves three or more dimensions, while linear regression is in two dimensions and can be readily displayed in a graph. Multiple regression involves much more complexity of calculations than linear regression, which results in more robust outcomes but is more difficult for clients and other real estate professionals to understand.
Based on your regression analysis of condominium sales in Nanaimo, you determine: Market Value = $42,000 + $70(living area in sqft) + $5,000(number of bathrooms)
What do the coefficients in the equation represent? How many bathrooms would you expect a 1,000 square-foot condo that sells for $119,500 to have?
The $70 coefficient represents the addition to value that each square foot of living area contributes to value, while the $5,000 coefficient represents the value of each additional bathroom. These coefficients can also be used as adjustment factors. Using the equation, a 1,000 sqft. condo selling for $119,500 would have 1.5 bathrooms. $119,500 = $42,000 + $70(1,000 sqft.) + $5,000 (number of bathrooms).
Additive multiple regression includes a major assumption that the impact of the coefficient for a specific independent variable xi is independent of the impact of other variables, e.g., x2, x3, x4, etc. In other words, the impact of one independent variable on the dependent variable Y, is assumed to not be related to changes in another independent variable. When these assumptions turn out to be false, what problem do we have? How can this issue be overcome?
Multicollinearity. This issue can be identified during initial data exploration. If two independent variables show high correlation, there is a potential for problems in the model. It may be necessary to exclude one or more variables from the model and re-test the regression. The Tolerance and VIF statistics are tests for multicollinearity which can be applied to regression models. A low Tolerance (less than 0.3) and high VIF (greater than 3.333) outcome is a warning sign that multicollinearity exists.
You conduct a regression analysis of detached single family housing prices in Langley, and then use the regression formula to calculate predicted values for your data set and the residuals (actual sales price -predicted value). What kind of results should you expect when you analyze the descriptive statistics for the residuals?
The residuals should have a mean of zero because the regression model’s function is to find the line of best fit that limits each observation’s residual from this line. The median however, may be positive or negative depending on the skewness in the distribution of predicted values.
The first step in testing for multicollinearity is conducted during data-screening where the correlation of each of the independent variables is determined. What other steps can be taken to ensure that multicollinearity is not present in your model?
After creating your model, you should examine both the Tolerance and VIF statistics for each variable, where Tolerance = 1/VIF. If the tolerance of any of the variables is less than 0.3, or subsequently, the VIF is greater than 3.333, multicollinearity exists and the model should be revised.
1. A high SEE indicates:
- a better result.
- a worse result.
- that multiple regression analysis is not a viable option.
- multicollinearity exists.
Answer: (2)
The lower the SEE, or standard error of the estimate, the better the model result.
A high VIF indicates:
- multicollinearity is not present.
- multicollinearity is present.
- the Tolerance is also high.
- the correlation coefficient is significant.
Answer: (2)
A high VIF indicates multicollinearity. VIF and Tolerance are inversely related, so a high VIF would mean a low Tolerance.
A COV under 10% indicates:
- a good result.
- a poor result.
- that multiple regression analysis is not a viable option.
- multicollinearity exists.
Answer: (1)
A low COV (coefficient of variation) indicates a good result. Because COV is an indication of error, a higher number is an indication of less accuracy in the model.
4. Consider the following statistics for two samples of apartment rents versus suite size for rental
apartments in Waterloo:
Dataset A, luxury high-rise concrete construction - R2 of .732 and SEE of 6,000 Dataset B, older 3-storey frame walk-up construction - R2 of .635 and SEE of 8,673
Rents are much higher in Dataset A than Dataset B. In which dataset would a regression equation more accurately predict the apartment rent?
(1) Dataset A since the R2 is higher and SEE lower than dataset B.
(2) Dataset B since the R2 is lower and SEE higher than dataset B.
(3) Both datasets would have equal statistical reliability.
(4) Impossible to determine because the R2 and SEE are based on absolute values, so relative comparisons are not possible.
Answer: (1)
The R2 value is higher in sample A than sample B, and the SEE is lower. Therefore, dataset A appears to both explain more variation and with less error. SEE is an absolute measure, which makes comparisons difficult without more information about the samples. However, because we know that rents are higher in dataset A, its lower SEE is even more convincing.
Consider the alternative: if the SEE for dataset A was higher, we would not be able to be certain if it was higher because of higher rents or higher because of more error. Because the SEE for dataset A is in fact lower, we can conclude the predictive error is in fact lower.
6. What problem might you encounter if you use the regression equation for Reginal data to predict the sale price of a 1,800 square foot rental apartment?
(1) The regression does not account for land size.
(2) The regression line is only based on data up to 1,143 square feet and the relationship may change above this range.
(3) There may not be a causal relationship between the two variables.
(4) There are too many outliers.
Answer: (2)
It is risky to extrapolate relationships beyond the dataset since the relationships may not longer be linear, or the linearity may change. A larger dataset should be collected.
7. In the Reginal example, the regression equation contained a constant of 17,918. What is another way of expressing this constant?
(1) Minimum condominium price.
(2) If the regression line is graphed, it will intercept the X-axis at square feet = 17,918.
(3) The mean difference between the regression line and all observations.
(4) If the regression line is graphed, it will intercept the Y-axis at $17,918.
Answer: (4)
The constant in the regression equation is the number at which the regression line intercepts the Y-axis.
8. Run a linear regression of Sale Price against Unit#. What can you conclude about the outcome?
(1) The Adjusted R-Squared of 0.026 indicates 97.4% of the variation in sale price is explained by unit number.
(2) A one unit increase in unit number is worth $75 in value.
(3) The small F-statistic provides confidence that the model results are significant.
(4) None of the above.
Answer: (4)
Using unit number to predict sale price has produced a poor model, which makes sense intuitively. Only 2.6% in variation of sale price is explained by unit number. The model indicates a one unit increase in unit number increases value by $65. However, the F-statistic is small, reducing confidence in the model results.
9. The standard error of the estimate is a good statistical tool for measuring:
(1) a mathematical expression of the best fit of ordered pairs.
(2) the percentage of variation in Y that can be explained by the regression line.
(3) the amount of dispersion of the observed data around the regression line.
(4) None of the above.
Answer: (3)
The SEE is a measure of the amount of dispersion of the observed data around the regression line. R 2 represents the percentage of the variation in Y (dependent variable) that is explained by the regression equation.