CHAPTER 6 / QUESTIONS Flashcards

1
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

After running a regression, you find that the model yields an SEE of 5,000. Is this a good result? What are the problems with using SEE as a measure of “goodness of fit”?

A

SEE is an absolute measure, meaning size alone does not tell us very much. In order to use SEE to analyze “goodness of fit”, we must convert it to a Coefficient of Variation (COV), by dividing the SEE by the mean. The COV tells us how well our model is doing in relative terms (percentage). A target COV for a good model is less than 10%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the difference between simple linear regression and multiple regression.

A

Multiple regression includes two or more independent variables; linear regression includes only one variable. Multiple regression is difficult to depict spatially since it involves three or more dimensions, while linear regression is in two dimensions and can be readily displayed in a graph. Multiple regression involves much more complexity of calculations than linear regression, which results in more robust outcomes but is more difficult for clients and other real estate professionals to understand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Based on your regression analysis of condominium sales in Nanaimo, you determine: Market Value = $42,000 + $70(living area in sqft) + $5,000(number of bathrooms)

What do the coefficients in the equation represent? How many bathrooms would you expect a 1,000 square-foot condo that sells for $119,500 to have?

A

The $70 coefficient represents the addition to value that each square foot of living area contributes to value, while the $5,000 coefficient represents the value of each additional bathroom. These coefficients can also be used as adjustment factors. Using the equation, a 1,000 sqft. condo selling for $119,500 would have 1.5 bathrooms. $119,500 = $42,000 + $70(1,000 sqft.) + $5,000 (number of bathrooms).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Additive multiple regression includes a major assumption that the impact of the coefficient for a specific independent variable xi is independent of the impact of other variables, e.g., x2, x3, x4, etc. In other words, the impact of one independent variable on the dependent variable Y, is assumed to not be related to changes in another independent variable. When these assumptions turn out to be false, what problem do we have? How can this issue be overcome?

A

Multicollinearity. This issue can be identified during initial data exploration. If two independent variables show high correlation, there is a potential for problems in the model. It may be necessary to exclude one or more variables from the model and re-test the regression. The Tolerance and VIF statistics are tests for multicollinearity which can be applied to regression models. A low Tolerance (less than 0.3) and high VIF (greater than 3.333) outcome is a warning sign that multicollinearity exists.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

You conduct a regression analysis of detached single family housing prices in Langley, and then use the regression formula to calculate predicted values for your data set and the residuals (actual sales price -predicted value). What kind of results should you expect when you analyze the descriptive statistics for the residuals?

A

The residuals should have a mean of zero because the regression model’s function is to find the line of best fit that limits each observation’s residual from this line. The median however, may be positive or negative depending on the skewness in the distribution of predicted values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

The first step in testing for multicollinearity is conducted during data-screening where the correlation of each of the independent variables is determined. What other steps can be taken to ensure that multicollinearity is not present in your model?

A

After creating your model, you should examine both the Tolerance and VIF statistics for each variable, where Tolerance = 1/VIF. If the tolerance of any of the variables is less than 0.3, or subsequently, the VIF is greater than 3.333, multicollinearity exists and the model should be revised.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

1. A high SEE indicates:

  1. a better result.
  2. a worse result.
  3. that multiple regression analysis is not a viable option.
  4. multicollinearity exists.
A

Answer: (2)

The lower the SEE, or standard error of the estimate, the better the model result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A high VIF indicates:

  1. multicollinearity is not present.
  2. multicollinearity is present.
  3. the Tolerance is also high.
  4. the correlation coefficient is significant.
A

Answer: (2)
A high VIF indicates multicollinearity. VIF and Tolerance are inversely related, so a high VIF would mean a low Tolerance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A COV under 10% indicates:

  1. a good result.
  2. a poor result.
  3. that multiple regression analysis is not a viable option.
  4. multicollinearity exists.
A

Answer: (1)
A low COV (coefficient of variation) indicates a good result. Because COV is an indication of error, a higher number is an indication of less accuracy in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

4. Consider the following statistics for two samples of apartment rents versus suite size for rental

apartments in Waterloo:

Dataset A, luxury high-rise concrete construction - R2 of .732 and SEE of 6,000 Dataset B, older 3-storey frame walk-up construction - R2 of .635 and SEE of 8,673

Rents are much higher in Dataset A than Dataset B. In which dataset would a regression equation more accurately predict the apartment rent?

(1) Dataset A since the R2 is higher and SEE lower than dataset B.

(2) Dataset B since the R2 is lower and SEE higher than dataset B.

(3) Both datasets would have equal statistical reliability.

(4) Impossible to determine because the R2 and SEE are based on absolute values, so relative comparisons are not possible.

A

Answer: (1)

The R2 value is higher in sample A than sample B, and the SEE is lower. Therefore, dataset A appears to both explain more variation and with less error. SEE is an absolute measure, which makes comparisons difficult without more information about the samples. However, because we know that rents are higher in dataset A, its lower SEE is even more convincing.

Consider the alternative: if the SEE for dataset A was higher, we would not be able to be certain if it was higher because of higher rents or higher because of more error. Because the SEE for dataset A is in fact lower, we can conclude the predictive error is in fact lower.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

6. What problem might you encounter if you use the regression equation for Reginal data to predict the sale price of a 1,800 square foot rental apartment?

(1) The regression does not account for land size.

(2) The regression line is only based on data up to 1,143 square feet and the relationship may change above this range.

(3) There may not be a causal relationship between the two variables.

(4) There are too many outliers.

A

Answer: (2)

It is risky to extrapolate relationships beyond the dataset since the relationships may not longer be linear, or the linearity may change. A larger dataset should be collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

7. In the Reginal example, the regression equation contained a constant of 17,918. What is another way of expressing this constant?

(1) Minimum condominium price.

(2) If the regression line is graphed, it will intercept the X-axis at square feet = 17,918.

(3) The mean difference between the regression line and all observations.

(4) If the regression line is graphed, it will intercept the Y-axis at $17,918.

A

Answer: (4)

The constant in the regression equation is the number at which the regression line intercepts the Y-axis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

8. Run a linear regression of Sale Price against Unit#. What can you conclude about the outcome?

(1) The Adjusted R-Squared of 0.026 indicates 97.4% of the variation in sale price is explained by unit number.

(2) A one unit increase in unit number is worth $75 in value.

(3) The small F-statistic provides confidence that the model results are significant.

(4) None of the above.

A

Answer: (4)

Using unit number to predict sale price has produced a poor model, which makes sense intuitively. Only 2.6% in variation of sale price is explained by unit number. The model indicates a one unit increase in unit number increases value by $65. However, the F-statistic is small, reducing confidence in the model results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

9. The standard error of the estimate is a good statistical tool for measuring:

(1) a mathematical expression of the best fit of ordered pairs.

(2) the percentage of variation in Y that can be explained by the regression line.

(3) the amount of dispersion of the observed data around the regression line.

(4) None of the above.

A

Answer: (3)

The SEE is a measure of the amount of dispersion of the observed data around the regression line. R 2 represents the percentage of the variation in Y (dependent variable) that is explained by the regression equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

10. Consider a model where the dependent variable is sale price and the independent variable is age of

building. This resulted in the following regression equation: Y = 100,500 - 960X and an R2 of 0.8. What can you conclude about these results?

(1) Each year adds $960 to value.

(2) Weak negative correlation with 64% of the variation in sale price explained by building age.

(3) Strong negative correlation with 80% of the variation in sale price explained by building age.

(4) A 1 year old building is worth $101,460.

A

10. Answer: (3)

The negative sign of the regression coefficient indicates negative correlation. The R2 at 0.8 indicates a strong correlation. Each year of age reduces value by $960.

17
Q

11. What is the advantage of multiple regression over simple linear regression?

(1) Helps deal with non-linear relationships.

(2) Provides the analyst an opportunity to aA

A

11. Answer: (3)

Simple linear regression only considers one independent variable. However, in reality, many independent variables may affect the dependent variable.

18
Q

13. The Regina2 dataset has high correlation among all the variables. Does this mean that it will always

have a high likelihood of predicting the sale price for any combination of the variables within the model parameters?

(1) Yes, the regression has accounted for virtually all the variation in the dependent variable.

(2) Not always, since there are other factors such as sampling technique, sample size, and COV which should be considered.

(3) Yes, since there is no longer any residual error.

(4) No, since only two of the variables are highly correlated.

A

13. Answer: (2)

The analyst must consider a variety of factors affecting the model, such as sampling technique or extrapolation issues, before determining the regression equation will produce acceptable predictions.

19
Q

14. Consider the regression statistics for the Regina2 dataset. What would your reaction be if the bathroom variable had a t-statistic of 0.105 and all other statistics for the remaining variables were unchanged?

(1) The bathrooms variable may offer no benefit to the model.

(2) We can no longer be confident that the bathroom coefficient value is correct.

(3) Our confidence in the significance of the bathrooms variable is improved.

(4) Both (1) and (2).

A

14. Answer: (4)

A t-statistic below the critical value of 2 means we can no longer be confident (at a 95% level) that the variable coefficient is different than zero. This means we are not confident its value is correct or if the variable can be removed without affecting the model.

20
Q

15. Run a boxplot of sale price versus total living area. What can you conclude?

(1) It is difficult to see any clear relationship due to the number of box entries.

(2) A clear correlation between sale price and floor height is evident.

(3) There are sufficient observations for each occurrence of sale price versus total living area.

(4) All of the above.

A

15. Answer: (1)

The total living area should be transformed into groups so that the relationship between the two variables could be better understood (e.g., fewer boxplots, each with more observations).

21
Q

17. Remove total living area from the model and review the new regression results. Which of the following statements is FALSE?

(1) The Adjusted R-Square decreases to 0.368.

(2) The SEE increases to 11,852.

(3) Floor number’s t-statistic improves, increasing to 1.183.

(4) Number of baths’ t-statistic increases, an improved result.

A

17. Answer: (3)

When total living area is removed from the model, floor number’s t-statistic is reduced to 1.183, but this is a poor result not an improvement. Because it is less than the critical value of 2, its significance is now questionable. The Sig of .239 also indicates this problem. The other statements are true.

22
Q

18. Consider a dataset with 4 variables: rent, gross rentable area (square feet), useable area (square feet), and floor level. A regression equation has been developed to predict rent using the other three independent variables. The R2 value for the relationship of two independent variables, gross rentable area versus useable area is .832. Would you rely on this model?

(1) No, the model is suspect since it does not contain multicollinearity.

(2) No, since useable area is poorly correlated with rent.

(3) Yes, since the R2 is quite high.

(4) No, variables which demonstrate multicollinearity should not be placed in the same model.

A

Answer: (4)

It would be necessary to exclude one of the variables and re-test the model to determine if the multicollinearity had been removed.

23
Q

19. Consider the unique scenario in which the sale price of single family detached homes is predicted using four variables: total finished area (square feet), lot size (acres), number of fireplaces, and number of bathrooms. Multiple regression analysis can be used to determine the coefficients for each independent variable. Which of the following statements is TRUE?

(1) The independent variable with the largest coefficient will always have the greatest effect on sale price.

(2) The independent variable with the smallest coefficient will always have the least effect on sale price.

(3) The effect of an independent variable’s coefficient depends on its size, but also on the nature of the variable and its unit of measurement.

(4) Total finished area will always have the greatest effect on sale price.

A

Answer: (3)

In general, a larger coefficient value indicates a greater effect. However, this will depend on the units used to measure the coefficient (e.g. acres versus square feet) and the sign (positive or negative). In this example, say the model found living area’s coefficient to be $67.04 per square foot and number of fireplaces’ to be $655. Fireplaces has the larger coefficient, but with only 1 or 2 fireplaces per house, living area very likely has more influence on sale price, even with a much smaller coefficient. Therefore, it is necessary to have a good understanding of the model before reaching conclusions based on the numerical value of the coefficients.

24
Q

20. Which of the following statements is TRUE?

(1) You should never remove outliers, as this compromises model results.

(2) Model testing is best carried out on the same sales used in creating the model, for consistency of results.

(3) A high correlation of a model’s predicted values and residuals is a poor result.

(4) All of the above are true.

A

Answer: (3)

A high correlation among a model’s predicted values and residuals may indicate a systematic over- or under-valuation from the model, which is a poor result. For a good quality predictive model, the correlation between the model’s predicted values and residuals should be zero. Option 1 is false, because removing outliers can improve results. However, you must be cautious about over-managing the data — in other words, getting the results you think you should get, versus looking at what the data is actually indicating. Option 2 is false because model testing is ideally carried out on data not used in creating the model.

25
Q

EVALUATING REGRESSION RESULTS

A

Six key statistics used in evaluating regression results. Four are measures of goodness of fit and relate to evaluation of the predictive accuracy of the equation. They are the coefficient of determination (R2), the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic.

In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors. The other two statistics, the correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model. The statistics we need are in the tables produced above.’

26
Q

COEFFICIENT OF DETERMINATION

A

R2 measures how much of the variability in the dependent variable (sale price) is accounted for (or explained) by the regression line. That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.

27
Q

POSSIBLE VALUES OF R2

A

Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.

28
Q

R2 ANALYSIS EXAMPLE

A

In our example, we found an R2 of 0.59 - this is displayed in the chart above and in the SPSS output.’

The R2 statistic measures the percentage of variation in the dependent variable (sale price) explained by the independent variable (living area). If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price).

In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price.

These two statements make intuitive sense at the very least - an important result, as common sense is a key factor in analyzing regression results!

29
Q

R2 SHORT COMINGS

A

The use of R2 has two shortcomings. First, as we add more regression variables, R2 can only increase or stay the same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales. Assume that we have regressed sales prices on eighteen independent variables and obtained an R2 of 0.920. Now suppose we re-run the model with a nineteenth variable, number of windows. As long as number of windows has any correlation whatsoever with sale price, R2 will increase to above 0.920.

Fortunately, R2 can be adjusted to account for the number of independent variables, resulting in its sister statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.

The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85 %” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.

30
Q

WHAT’S MORE IMPORTANT?

R2, STANDARD ESTIMATE, COV

A

In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.