CHAPTER 6 / NOTES Flashcards by Dennis Ragauskas

As discussed in Lessons 1 and 2, regression is a powerful tool used by many industries and research groups.

Regression can determine if there is a relationship between one thing (called the dependent variable) and one or more other things (called independent variables).

We see applications of this technique regularly reported in the media; e.g., “scientists have determined a relationship between smoking and lung cancer”. In statistical terms, this relationship is called a correlation and can be measured by the correlation coefficient (called R).

As discussed in Lessons 1 and 2, regression is a powerful tool used by many industries and research groups.

Regression can determine if there is a relationship between one thing (called the dependent variable) and one or more other things (called independent variables).

How well did you know this?

Not at all

Perfectly

A po**s**i**t**ive c**o**rr**e**l**a**t**ion between two variables, e.g., smoking and lung cancer, simply means that as one thing increases, the other also increases (or, alternatively, if one decreases, the other also decreases).

A negative correlation means the two items move in opposite directions–i.e.,as one thing increases,the other decreases. An example of a negative correlation would be the size of a car’s engine versus the gas mileage – the bigger the engine,the lower the kilometres you can expect per litre of gas.

The closer the correlation coefficient to +1 or -1, the stronger the relationship. A strong correlation exists when the correlation coefficient is between 0.8 and 1.0 (or -0.8 and -1.0).

For example, there is a perfect positive correlation (+1) between age and year of birth and a perfect negative correlation (-1) between age and life expectancy (or, perhaps, age and enjoyment of rap music).

The closer the correlation coefficient to +1 or -1, the stronger the relationship. A strong correlation exists when the correlation coefficient is between 0.8 and 1.0 (or -0.8 and -1.0).

How well did you know this?

Not at all

Perfectly

Building a Simple Linear Regression

The first thing we might do is plot an x-y scatter diagram to visualize what the data looks like. Scatterplots provide an efficient method of examining relationships among quantitative variables. You could graph the data by hand using graph paper and a sharp pencil, but it is much easier to do this using the computer. This scatterplot can be produced in SPSS or Excel. Lesson 2 provided instructions for these programs, but the SPSS instructions will be briefly reviewed below.

Select Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define

Select SalePrice for the Y-axis and Total_Area for the X-axis.

Select “Use chart specifications from:” and browse to the “RSQ1” template saved in Lesson 2.

Click OK to produce the chart.

Building a Simple Linear Regression

Select Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define

Select SalePrice for the Y-axis and Total_Area for the X-axis.

Select “Use chart specifications from:” and browse to the “RSQ1” template saved in Lesson 2.

Click OK to produce the chart.

How well did you know this?

Not at all

Perfectly

At this point, we are only interested in the Coefficients table. It includes the y-intercept for the regression line equation (17,918.03) as well as the slope (72.08), so we can write our regression line equation for living area and selling price as:

Selling Price = $17,918.03 + ($72.08 × Total Living Area)

This regression equation is the mathematical form of the line we “eyeballed” in the graph of the selling price and total living area.

Selling Price = $17,918.03 + ($72.08 × Total Living Area)

This regression equation is the mathematical form of the line we “eyeballed” in the graph of the selling price and total living area.

How well did you know this?

Not at all

Perfectly

Evaluating Regression Results

We will look at six key statistics used in evaluating regression results. Four are measures of goodness of fit and relate to evaluationof the predictive accuracyof the equation. They are the coefficient of determination (R2) ,the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors. The other two statistics, the correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model. The statistics we need are in the tables produced above.

Evaluating Regression Results

How well did you know this?

Not at all

Perfectly

Coefficient of Determination

There are a number of additional measures that can be used to determine how well our regression line predicts the selling price. One of the most common is R2, called the c**o**effi**c**ient o**f d**etermination (the correlation coefficient squared). R2 measures how much of thevariability in the dependent variable(sale price) is accounted for (or explained) by the regression line. That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.

Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.

In our example, we found an R2 of 0.59 – this is displayed in the chart above and in the SPSS output. The R2 statistic measures the percentage of variation in thedependent variable (saleprice) explained by the independent variable (living area). If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least – an important result, as common sense is a key factor in analyzing regression results!

Coefficient of Determination

How well did you know this?

Not at all

Perfectly

The use of R2 has two shortcomings. First, as we addmore regressionvariables, R2 can only increaseorstaythe same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.

Assume that we have regressed sales prices on eighteen independent variables and obtained an R2 of 0.920. Now suppose we re-run the model with a nineteenth variable, number of windows. As long as number of windows has any correlation whatsoever with sale price, R2 will increase to above 0.920.

How well did you know this?

Not at all

Perfectly

R2 can be adjusted to account for the number of independent variables, resulting in its sister
statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.

How well did you know this?

Not at all

Perfectly

The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85%” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.

How well did you know this?

Not at all

Perfectly

In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately. This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction.

For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices. A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained.

A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of thevariation is only $20,000.The standard error and COV(discussed later) will show this improvement.

How well did you know this?

Not at all

Perfectly

In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.

How well did you know this?

Not at all

Perfectly

Standard Error of the Estimate

The analyst must not only be able to estimate the equation for the regression line, he or she must also be able to measure how well the regression line fits the points. The techniques provided so far enable the analyst to determinea best fitregressionline and measure its overall goodness of fit using R2.

However,it is also desirable to find out how well the regression equation fits each individual observation. It may be that the best fit line is very accurate at representing the data, or alternatively, if the data points are highly dispersed, the best fit line may be very poor.

The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regressionline and the actuals ample observations.The SEE measures the amount of deviation between actual and predicted sales prices. If the SEE is small, the observations are tightly scattered around the regression line. If the SEE is large, the observations are widely scattered around the regression line. The smaller the standard error, the better the fit.

Standard Error of the Estimate

However, it is also desirable to find out how well the regression equation fits each individual observation. It may be that the best fit line is very accurate at representing the data, or alternatively, if the data points are highly dispersed, the best fit line may be very poor.

How well did you know this?

Not at all

Perfectly

In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95% will be $19,112 or less (see Figure 2.1 in Lesson 2).

In general, you want a small SEE relative to the size of the dependent variable – in our case the selling price. Say, for example, you were running several different potential models for estimating salepricewith a variety ofvariables. You could then compare the R2 and SEE for each to see which predicts the most variation in selling price with the least associated error.

The SEE is free from the second interpretive shortcoming of R2 mentioned above. In other words, whereas R2 evaluates the seriousness of the errors indirectly by comparing them with the variation of the sales prices, the SEE evaluates them directly in dollar terms. The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from itthat tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.

In general, you want a small SEE relative to the size of the dependent variable – in our case the selling price. Say, for example, you were running several different potential models for estimating salepricewith a variety of variables. You could then compare the R2 and SEE for each to see which predicts the most variation in selling price with the least associated error.

How well did you know this?

Not at all

Perfectly

Coefficient of Variation

In our example, the SEE is $9,556. This would indicate a good predictive model when mean property values are high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.

In regression analysis, the coefficient of variation (COV) is the SEE expressed as a percentage of the mean sale price and multiplied by 100. The formula is the same as that described in Lesson 1, except that the SEE notation replaces the (standard deviation) notation.

Most regression software reports the SEE but not the COV, so we have to calculate it manually. Here, the COV is calculated by dividing the SEE (9,556.24) by the mean of the sale prices (76,593.50), yielding 12.48%. In general, for residential models which have saleprice as the dependent variable, a COV of approximately 20% is acceptable, while a COV of approximately 10% indicates a very good result.

At 12.5%, our model’s COV is acceptably small, but not fantastic. This tells us that total square footage of living area does a fairly good job of predicting sale price, but there is more to sale price than just this one variable (as we would expect!).

Our COV implies that, given a normal distribution, roughly two-thirds of sales prices lie within 12.5% of their MRA-predicted values.

Coefficient of Variation

In our example, theSEE is $9,556. This wouldindicate a goodpredictive model when mean propertyvaluesare high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.

Our COV implies that, given a normal distribution, roughly two-thirds of sales prices lie within 12.5% of their MRA-predicted values.

How well did you know this?

Not at all

Perfectly

Correlation Coefficient

The correlation coefficient (r) is the first of two statistics that relate to individual regression variables. As explained in Lesson 1, the correlation coefficient is a measure that indicates the strength of the relationship between two variables. It can take on values from -1.0 to +1.0, ranging from very strong negative correlation to very strong positive correlation, or somewhere in between.

In our example, the correlation between sale price and living area is 0.7696 (rounded to 0.77). This is a moderate level of correlation getting close to being strong (0.80 is considered strong). So, it seems that our simple estimate is doing a pretty good job(based on this sample data of course).There is a strong positive linear relationship between square feet and sale price. Given the regression coefficient of $72.08, as the number of square feet increases by 1, the estimated sale price increases by $72.08.

Correlation Coefficient

How well did you know this?

Not at all

Perfectly

t–Statistic

The t**-statistic is a measure of the significance or importance of a regression variable in explaining differences in the dependent variable (sale price). It tests whether the slope of the regression line is equal to zero. Put simply, the t-statistics provides information as to the “goodness” of the regression. It helps answer the question: “does living area provide information in estimating selling price for condos in this market area?”.

Study These Flashcards

t–Statistic

t-Statistic

The t-values and their associated significance levels indicate the degree of confidence one can place on the regression coefficients. The significance of the t-values varies with the number of observations, so the significance level is more useful for determining the relevance of the variables.

Higher t-values and lower significance levels increase the reliance the model builder can place on the statistical significance of the coefficients. A high t-value leads to the acceptance of the hypothesis that the coefficient is significantly different than zero.

Study These Flashcards

t-Statistic

What constitutes “high values” of t? Statistical tables provide the answer. These tables are based on the amount of confidence you would like in your answer and the number of degrees of freedom your data provides you (degrees of freedom was defined in Lesson 2).

Generally, if you have plenty of data and want to have a statistical confidence of 95% in your answer, the critical value that the t-statistic must exceed is ±1.96.

A t-statistic in excess of ± 2.58 indicates that one can be 99% confident that the independent variable is significant in the prediction of sale price.

Study These Flashcards

t-Statistic

Generally, if you have plenty of data and want to have a statistical confidence of 95% in your answer, the critical value that the t-statistic must exceed is ±1.96.

A t-statistic in excess of ± 2.58 indicates that one can be 99% confident that the independent variable is significant in the prediction of sale price.

t-Statistic

The t-statistic is dependent on the number of observations and therefore we cannot specify a universal value for acceptance or rejection. However, as a rough rule-of-thumb, modelers often use critical levels of t-statistic over 1.6 (90% confidence) or 2.0 (95% confidence). A significance level of .10 suggests that one can be at least 90% confident that the variable coefficient is significantly different from 0 – or, in other words, less than 10% probability that the coefficient is equal to zero.

If the probability is high that the coefficient is equal to zero, this would indicate that the variable provides no useful information to the model. A significance level of less than .05 would indicate that the probability of the coefficient being equal to zero is 5% or less, which indicates a reliable result. Normally in mass appraisal work, a significance level of less than .10 is desired, and often .05 or less

Study These Flashcards

t-Statistic

Our t-statistic for living area is 13.093. If we were to refer to a t-table, we would find that when t is outside of ± 3.767, one can be 99.9% confident that the coefficient does not equal 0.7 Therefore, in this case we can conclude with confidence that square feet of living area is significant in estimating residential values.

Contrasttheseresultswithanotherregressionmodelestimatingsalepriceusingtheindependentvariables,number of bedrooms and family rooms. If the variables had low t-statistics indicating significance values of 0.842 and 0.919 respectively, this indicates an 84% and 92% probability that the coefficients for these variables are actually equal to zero. Or, in other words, a high probability that neither of these variables are useful in the model.

Study These Flashcards

t-Statistic

F-Statistic

The F-Statistic (F-value or F-ratio) also provides information as to the “goodness” of the regression. It also helps answer the question: “does living area provide information in estimating selling price for condos in this market area?”. The F-Statistic shows the overall quality of the regression, as opposed to the usefulness of the individual variables as reported by the t-statistic.

Study These Flashcards

F-Statistic

F-STATISTIC

The F-value is related to the correlation coefficient (r). It measures whether the overall regression relationship is significant; that is, it tests whether the model is useful in representing the sample data. The F-value is a ratio showing the portion of the total variation of the dependent variable that is explained by the regression divided by the remaining variation that is left unexplained by the model.

F — variance explained by the regression ÷ variance unexplained

If explained variation is small relative to unexplained variation, the regression equation does not fit the data well and the regression results are not considered statistically significant. A small value of F (generally less than 4) leads to acceptance of the hypothesis that the regression relationship is not significant. If F is large, the hypothesis that the derived regression model is not significant is rejected and it is concluded that the overall regression results are statistically significant.

Study These Flashcards

F-STATISTIC

F — variance explained by the regression ÷ variance unexplained

F-STATISTIC

Similar to the t-statistics, critical values for F are found in statistical tables. The F-statistic is simply the t-statistic squared, so the critical values for F are 3.842 (95 % confidence) and 6.636 (99% confidence) respectively. For a rough rule-of-thumb, modelers often use a critical level of F-statistic > 4 to indicate a statistically significant relationship.

Continuing with our example, the F-ratio of 171.422 is quite a bit larger than 4. This indicates that the estimates produced by the regression model provide a better representation of the sample data than the mean of the observations. In other words, the regression estimates fit the data well and the results are statistically significant. The size of the F-ratio above the critical value of 4 must be viewed with caution. At larger magnitudes, the F-ratio is useful mostly as a relative measure; for example, if two models are identical in all respects other than their F-ratios, the model with the larger F-ratio is probably the better one.

The absolute measure of the F-ratio is less meaningful because F-ratios are sensitive to the number of observations and the number of variables in the model. Few observations, together with a relatively large number of variables, will generally produce a low F-ratio.

The large F-ratio in our example is greater than the critical value of 4 and indicates that the estimates produced by the model are better predictors of value than the mean. However, the large number of observations and few variables in the model would be expected to produce a very high F-ratio.

Study These Flashcards

F-STATISTIC

Summary: Evaluating Regression Results

In evaluating regression models, it is important to evaluate both how well the regression model captures the observed variation in the dependent variable (price), as well as the error generated from the model.

When checking the regression output, the following points are important:

the coefficients have the expected sign (positive or negative);
the t-statistics are significant, i.e., greater than 1.64 (significance level less than .10);
the F-statistic is “large” and the probability provided with the F-statistic should be less than .05;
the standard error of the estimate or SEE (also termed the “root mean square error” or RMSE) should be small;
the Coefficient of Variation (COV = SEE -1- Mean Sale Price) should be small; and
the adjusted le should be large.

Note that these are just general guidelines and cannot be applied universally in all cases. Regression analysis is extremely complex and there are many interrelated factors that can affect results. Because of this complexity, the analyst must be very careful about not relying on universal measures or “cookbook” procedures.

Study These Flashcards

Summary: Evaluating Regression Results

When checking the regression output, the following points are important:

the coefficients have the expected sign (positive or negative);
the t-statistics are significant, i.e., greater than 1.64 (significance level less than .10);
the F-statistic is “large” and the probability provided with the F-statistic should be less than .05;
the standard error of the estimate or SEE (also termed the “root mean square error” or RMSE) should be small;
the Coefficient of Variation (COV = SEE -1- Mean Sale Price) should be small; and
the adjusted le should be large.

**Multicollinearity** **With all of these statistics indicating positive results, it appears we can conclude this is a good model to estimate the selling price of condominiums in this market area. However, there is one more element of the model that needs to be checked before we can make this claim; we must examine for multicollinearity.** **When we created the simple regression model between sale price and living area we were only concerned about the one relationship between sales price and the living area. In creating our more complex model, we must consider the relationship between sale price and each of living area, floor number, and bathrooms, but also the relationships among the independent variables - that is, how living area and bathrooms, living area and floor number, and bathrooms and floor number may be related. If any of these three combinations show any significant correlation, then we have multicollinearity in our model. The existence of high multicollinearity can invalidate an MRA model. This is because the overlap in the variables will cause the MRA process to become "confused" and the values of the coefficients will be inaccurate.**

**Multicollinearity** **The first part of multicollinearity testing should be done during data screening, prior to running the regression (as will be seen in the following lessons). The second part of this testing should be done after the model is generated. The part that can be done beforehand is the examination of the correlation matrix. As can be seen in the Correlation table included in the regression results, our three variables have the following correlations:** * **living area and bathrooms 0.370** * **living area and floor number -0.090** * **bathrooms and floor number 0.411**

**Multicollinearity** **Variables with correlations over ±0.500 should be closely examined, although generally only those over ±0.800 will cause problems in an MRA model. At the outset of specifying a model, variables with correlations over ±0.800 should not be placed in the same model. In our case, the correlations are all low enough not to be of concern.**

**Multicollinearity** **The second measure for multicollinearity is generated when the model is created, in the Tolerance and VIF (variance inflation factor) statistics. These two statistics measure the same thing, as they are inversely related; that is, Tolerance = 1 + VIF. The Tolerance should be greater than 0.3 (and the VIF less than 3.333). A variable that has a tolerance value less than the target of 0.3 is considered to show a degree of multicollinearity which can have a serious effect on the value of its coefficient. A modeler must be wary to watch for low tolerance (high VIF) as the coefficients may be inaccurate.** **In our case, the tolerances are:** * **living area 0.793** * **floor number 0.763** * **bathrooms 0.664** **All are greater than the critical value and indicate no multicollinearity. We can safely conclude that we have produced a good model to estimate the selling price of condominiums in this market area.**

**SUMMARY** **One of the key aspects of modeling is interpreting regression results. When checking the regression output, the following points are important:** * **the coefficients have the expected sign (positive or negative);** * **the t-statistics are significant, i.e., greater than 1.64 (significance level less than .10);** * **the F-statistic is "large" and the probability provided with the F-statistic should be less than .05;** * **the standard error of the estimate or SEE (also termed the root mean square error or RMSE) should be small;** * **the Coefficient of Variation (COV = SEE ± Mean Sale Price) should be small; and** * **the adjusted R² should be large.** **Note that these are just general guidelines and cannot be applied universally in all cases. Regression analysis is extremely complex and there are many interrelated factors that can affect results. Because of this complexity, the analyst must be very careful about not relying on universal measures or "cookbook" procedures.** **In the next two lessons, we will expand this simple coverage of regression modeling to include more variables and more complex analysis, to further develop the model building skills introduced in this lesson.**

CHAPTER 6 / NOTES Flashcards

(29 cards)