BUSI344 / CHAPTER 6 QUESTIONS 2 Flashcards

1
Q

FOUR MEASURES OF GOODNESS OF FIT

A

They are the coefficient of determination (R2), the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

TWO MEASURES THAT RELATE TO THE IMPORTANCE OF INDIVIDUAL VARIABLES

A

The correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

COEEFICIENT OF DETERMINATION

A

R2 measures how much of the variability in the dependent variable (sale price) is accounted for (or explained) by the regression line.

That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

POSSIBLE VALUES OF R2

A

Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

R2 STATISTIC MEASURES . . . . .

A

The R2 statistic measures the percentage of variation in the dependent variable (sale price) explained by the independent variable (living area).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

INTERPRETATION OF R2 - EXAMPLE

A

If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least - an important result, as common sense is a key factor in analyzing regression results!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

R2 HAS TWO SHORT COMINGS

A

The use of R2 has two shortcomings. First, as we add more regression variables, R2 can only increase or stay the same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.

The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85 %” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.​

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ADJUSTED R2

A

R2 can be adjusted to account for the number of independent variables, resulting in its sister statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

WHAT’S MORE IMPORTANT IN REGRESSION MODELS?

A

In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

MEASURES THE DIFFERENCE BETWEEN REGRESSION LINE AND ACTUAL OBSERVATIONS

A

The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regression line and the actual sample observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

THE SEE MEASURES . . . .

A

The SEE measures the amount of deviation between actual and predicted sales prices.’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

DISTRIBUTION OF REGRESSION ERRORS

A

In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95 % will be $19,112 or less (see Figure 2.1 in Lesson 2).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

NOTE ON R2

A

In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately.

This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction. For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices.

A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained. A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of the variation is only $20,000. The standard error and COV (discussed later) will show this improvement.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PROBLEM WITH USING SEE

A

The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from it that tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

EXPRESSING SEE AS A PERCENTAGE

A

In our example, the SEE is $9,556. This would indicate a good predictive model when mean property values are high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

COEFFICIENT OF VARIATION IS . . .

A

In regression analysis, the coefficient of variation (COV) is the SEE expressed as a percentage of the mean sale price and multiplied by 100.

17
Q

INTERPRETING THE COV

A

The COV is calculated by dividing the SEE (9,556.24) by the mean of the sale prices (76,593.50), yielding 12.48%. In general, for residential models which have sale price as the dependent variable, a COV of approximately 20% is acceptable, while a COV of approximately 10% indicates a very good result. At 12.5%, our model’s COV is acceptably small, but not fantastic. This tells us that total square footage of living area does a fairly good job of predicting sale price, but there is more to sale price than just this one variable (as we would expect!).

18
Q

THE CORRELATION COEFFICIENT

A

The correlation coefficient (r) is the first of two statistics that relate to individual regression variables. As explained in Lesson 1, the correlation coefficient is a measure that indicates the strength of the relationship between two variables. It can take on values from -1.0 to +1.0, ranging from very strong negative correlation to very strong positive correlation, or somewhere in between.

19
Q

SIZE OF SEE

A

If the SEE is small, the observations are tightly scattered around the regression line. If the SEE is large, the observations are widely scattered around the regression line. The smaller the standard error, the better the fit.

20
Q

THE CORRELATION COEFFICIENT MEASURES

A

The correlation coefficient measures how strongly two variables have a straight line relation to each other, but does not give the exact relationship. Two sets of data (x,y) yielding exactly the same regression equation (straight line) may have very different correlation coefficients between x and y.

21
Q

REGRESSION COEFFICIENTS INDICATE . . .

A

Regression coefficients, indicate how variables are related; that is, how many units (dollars) the dependent variable changes when the independent variable changes by one unit (for example, one square foot) with other variables in the equation held constant.

22
Q

T-STATISTIC IS A MEASURE OF . . .

A

The t-statistic is a measure of the significance or importance of a regression variable in explaining differences in the dependent variable (sale price).

23
Q

WHAT IS CONSIDERED A HIGH T VALUE?

A

Generally, if you have plenty of data and want to have a statistical confidence of 95 % in your answer, the critical value that the t-statistic must exceed is +1.96. A t-statistic in excess of ± 2.58 indicates that one can be 99% confident that the independent variable is significant in the prediction of sale price.

24
Q

T STATISTIC RULES OF THUMB

A

As a rough rule-of-thumb, modelers often use critical levels of t-statistic over 1.6 (90 % confidence) or 2.0 (95 % confidence).

A significance level of .10 suggests that one can be at least 90% confident that the variable coefficient is significantly different from 0 - or, in other words, less than 10 % probability that the coefficient is equal to zero. If the probability is high that the coefficient is equal to zero, this would indicate that the variable provides no useful information to the model.

A significance level of less than .05 would indicate that the probability of the coefficient being equal to zero is 5 % or less, which indicates a reliable result. Normally in mass appraisal work, a significance level of less than .10 is desired, and often .05 or less.

25
Q

F-VALUE INFORMATION

A

The F-Statistic (F-value or F-ratio) also provides information as to the “goodness” of the regression.

26
Q

F-STATISTIC SHOWS

A

The F-Statistic shows the overall quality of the regression, as opposed to the usefulness of the individual variables as reported by the t-statistic.

27
Q

F-VALUE IS RELATED TO . . .

A

The F-value is related to the correlation coefficient (r). It measures whether the overall regression relationship is significant; that is, it tests whether the model is useful in representing the sample data. The F-value is a ratio showing the portion of the total variation of the dependent variable that is explained by the regression divided by the remaining variation that is left unexplained by the model.

28
Q

A SMALL F-VALUE

A

If explained variation is small relative to unexplained variation, the regression equation does not fit the data well and the regression results are not considered statistically significant. A small value of F (generally less than 4) leads to acceptance of the hypothesis that the regression relationship is not significant.

29
Q

F-VALUE RULE OF THUMB

A

For a rough rule-of-thumb, modelers often use a critical level of F-statistic > 4 to indicate a statistically significant relationship.

30
Q

INTERPRETATION OF THE F-RATIO

A

Continuing with our example, the F-ratio of 171.422 is quite a bit larger than 4. This indicates that the estimates produced by the regression model provide a better representation of the sample data than the mean of the observations.

In other words, the regression estimates fit the data well and the results are statistically significant. The size of the F-ratio above the critical value of 4 must be viewed with caution. At larger magnitudes, the F-ratio is useful mostly as a relative measure; for example, if two models are identical in all respects other than their F-ratios, the model with the larger F-ratio is probably the better one. The absolute measure of the F-ratio is less meaningful because F-ratios are sensitive to the number of observations and the number of variables in the model. Few observations, together with a relatively large number of variables, will generally produce a low F-ratio.

The large F-ratio in our example is greater than the critical value of 4 and indicates that the estimates produced by the model are better predictors of value than the mean. However, the large number of observations and few variables in the model would be expected to produce a very high F-ratio.

31
Q

CHECKING THE REGRESSION OUTPUT

A

When checking the regression output, the following points are important:

  • the coefficients have the expected sign (positive or negative);
  • the t-statistics are significant, i.e., greater than 1.64 (significance level less than .10);
  • the F-statistic is “large” and the probability provided with the F-statistic should be less than .05;
  • the standard error of the estimate or SEE (also termed the “root mean square error” or RMSE) should be small;
  • the Coefficient of Variation should be small; and
  • the adjusted R2 should be large.
32
Q

SAMPLE INTERPRETATION

A

Overall, our model appears to reasonably approximate sale price:

  • the coefficient is +$72.08, which makes intuitive sense - as living area increases, price increases;
  • the t-statistic is good at 13.093 (significance level is .000);
  • the F-statistic is large at 171 and the associated significance is .000;
  • the SEE of $9,556 is small relative to the $76,593 mean;
  • the COV is 12.48%, which is acceptable but larger than optimal; and
  • the adjusted R2 at 0.59 is reasonably large, but not great, as 40% of variation in sale price remains unexplained.
33
Q

MULTICOLLINEARITY

A

With all of these statistics indicating positive results, it appears we can conclude this is a good model to estimate the selling price of condominiums in this market area. However, there is one more element of the model that needs to be checked before we can make this claim; we must examine for multicollinearity.

When we created the simple regression model between sale price and living area we were only concerned about the one relationship between sales price and the living area. In creating our more complex model, we must consider the relationship between sale price and each of living area, floor number, and bathrooms, but also the relationships among the independent variables - that is, how living area and bathrooms, living area and floor number, and bathrooms and floor number may be related. If any of these three combinations show any significant correlation, then we have multicollinearity in our model.

The existence of high multicollinearity can invalidate an MRA model. This is because the overlap in the variables will cause the MRA process to become “confused” and the values of the coefficients will be inaccurate.

34
Q

MULTICOLLINEARITY II

A

The first part of multicollinearity testing should be done during data screening, prior to running the regression (as will be seen in the following lessons). The second part of this testing should be done after the model is generated. The part that can be done beforehand is the examination of the correlation matrix. As can be seen in the Correlation table included in the regression results, our three variables have the following correlations:

  • living area and bathrooms 0.370
  • living area and floor number -0.090
  • bathrooms and floor number 0.411
35
Q

MULTICOLLINEARITY III

A

Variables with correlations over ±0.500 should be closely examined, although generally only those over ±0.800 will cause problems in an MRA model. At the outset of specifying a model, variables with correlations over ±0.800 should not be placed in the same model. In our case, the correlations are all low enough not to be of concern.

36
Q

MULTICOLLINEARITY IV

A

The second measure for multicollinearity is generated when the model is created, in the Tolerance and VIF (variance inflation factor) statistics. These two statistics measure the same thing, as they are inversely related; that is, Tolerance = 1 + VIF. The Tolerance should be greater than 0.3 (and the VIF less than 3.333).

A variable that has a tolerance value less than the target of 0.3 is considered to show a degree of multicollinearity which can have a serious effect on the value of its coefficient. A modeler must be wary to watch for low tolerance (high VIF) as the coefficients may be inaccurate.

In our case, the tolerances are:

  • living area 0.793
  • floor number 0.763
  • bathrooms 0.664

All are greater than the critical value and indicate no multicollinearity. We can safely conclude that we have produced a good model to estimate the selling price of condominiums in this market area.