BUSI344 / CHAPTER 6 QUESTIONS 2 Flashcards
FOUR MEASURES OF GOODNESS OF FIT
They are the coefficient of determination (R2), the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors.
TWO MEASURES THAT RELATE TO THE IMPORTANCE OF INDIVIDUAL VARIABLES
The correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model.
COEEFICIENT OF DETERMINATION
R2 measures how much of the variability in the dependent variable (sale price) is accounted for (or explained) by the regression line.
That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.
POSSIBLE VALUES OF R2
Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.
R2 STATISTIC MEASURES . . . . .
The R2 statistic measures the percentage of variation in the dependent variable (sale price) explained by the independent variable (living area).
INTERPRETATION OF R2 - EXAMPLE
If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least - an important result, as common sense is a key factor in analyzing regression results!
R2 HAS TWO SHORT COMINGS
The use of R2 has two shortcomings. First, as we add more regression variables, R2 can only increase or stay the same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.
The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85 %” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.
ADJUSTED R2
R2 can be adjusted to account for the number of independent variables, resulting in its sister statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.
WHAT’S MORE IMPORTANT IN REGRESSION MODELS?
In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.
MEASURES THE DIFFERENCE BETWEEN REGRESSION LINE AND ACTUAL OBSERVATIONS
The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regression line and the actual sample observations.
THE SEE MEASURES . . . .
The SEE measures the amount of deviation between actual and predicted sales prices.’
DISTRIBUTION OF REGRESSION ERRORS
In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95 % will be $19,112 or less (see Figure 2.1 in Lesson 2).
NOTE ON R2
In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately.
This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction. For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices.
A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained. A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of the variation is only $20,000. The standard error and COV (discussed later) will show this improvement.
PROBLEM WITH USING SEE
The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from it that tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.
EXPRESSING SEE AS A PERCENTAGE
In our example, the SEE is $9,556. This would indicate a good predictive model when mean property values are high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.