CHAPTER 6 / NOTES Flashcards
As discussed in Lessons 1 and 2, regression is a powerful tool used by many industries and research groups.
Regression can determine if there is a relationship between one thing (called the dependent variable) and one or more other things (called independent variables).
We see applications of this technique regularly reported in the media; e.g., “scientists have determined a relationship between smoking and lung cancer”. In statistical terms, this relationship is called a correlation and can be measured by the correlation coefficient (called R).
As discussed in Lessons 1 and 2, regression is a powerful tool used by many industries and research groups.
Regression can determine if there is a relationship between one thing (called the dependent variable) and one or more other things (called independent variables).
We see applications of this technique regularly reported in the media; e.g., “scientists have determined a relationship between smoking and lung cancer”. In statistical terms, this relationship is called a correlation and can be measured by the correlation coefficient (called R).
A po**s**i**t**ive c**o**rr**e**l**a**t**ion between two variables, e.g., smoking and lung cancer, simply means that as one thing increases, the other also increases (or, alternatively, if one decreases, the other also decreases).
A negative correlation means the two items move in opposite directions–i.e.,as one thing increases,the other decreases. An example of a negative correlation would be the size of a car’s engine versus the gas mileage – the bigger the engine,the lower the kilometres you can expect per litre of gas.
The closer the correlation coefficient to +1 or -1, the stronger the relationship. A strong correlation exists when the correlation coefficient is between 0.8 and 1.0 (or -0.8 and -1.0).
For example, there is a perfect positive correlation (+1) between age and year of birth and a perfect negative correlation (-1) between age and life expectancy (or, perhaps, age and enjoyment of rap music).
A po**s**i**t**ive c**o**rr**e**l**a**t**ion between two variables, e.g., smoking and lung cancer, simply means that as one thing increases, the other also increases (or, alternatively, if one decreases, the other also decreases).
A negative correlation means the two items move in opposite directions–i.e.,as one thing increases,the other decreases. An example of a negative correlation would be the size of a car’s engine versus the gas mileage – the bigger the engine,the lower the kilometres you can expect per litre of gas.
The closer the correlation coefficient to +1 or -1, the stronger the relationship. A strong correlation exists when the correlation coefficient is between 0.8 and 1.0 (or -0.8 and -1.0).
For example, there is a perfect positive correlation (+1) between age and year of birth and a perfect negative correlation (-1) between age and life expectancy (or, perhaps, age and enjoyment of rap music).
Building a Simple Linear Regression
The first thing we might do is plot an x-y scatter diagram to visualize what the data looks like. Scatterplots provide an efficient method of examining relationships among quantitative variables. You could graph the data by hand using graph paper and a sharp pencil, but it is much easier to do this using the computer. This scatterplot can be produced in SPSS or Excel. Lesson 2 provided instructions for these programs, but the SPSS instructions will be briefly reviewed below.
Select Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define
Select SalePrice for the Y-axis and Total_Area for the X-axis.
Select “Use chart specifications from:” and browse to the “RSQ1” template saved in Lesson 2.
Click OK to produce the chart.
Building a Simple Linear Regression
The first thing we might do is plot an x-y scatter diagram to visualize what the data looks like. Scatterplots provide an efficient method of examining relationships among quantitative variables. You could graph the data by hand using graph paper and a sharp pencil, but it is much easier to do this using the computer. This scatterplot can be produced in SPSS or Excel. Lesson 2 provided instructions for these programs, but the SPSS instructions will be briefly reviewed below.
Select Graphs → Legacy Dialogs → Scatter/Dot → Simple Scatter → Define
Select SalePrice for the Y-axis and Total_Area for the X-axis.
Select “Use chart specifications from:” and browse to the “RSQ1” template saved in Lesson 2.
Click OK to produce the chart.
At this point, we are only interested in the Coefficients table. It includes the y-intercept for the regression line equation (17,918.03) as well as the slope (72.08), so we can write our regression line equation for living area and selling price as:
Selling Price = $17,918.03 + ($72.08 × Total Living Area)
This regression equation is the mathematical form of the line we “eyeballed” in the graph of the selling price and total living area.
At this point, we are only interested in the Coefficients table. It includes the y-intercept for the regression line equation (17,918.03) as well as the slope (72.08), so we can write our regression line equation for living area and selling price as:
Selling Price = $17,918.03 + ($72.08 × Total Living Area)
This regression equation is the mathematical form of the line we “eyeballed” in the graph of the selling price and total living area.
Evaluating Regression Results
We will look at six key statistics used in evaluating regression results. Four are measures of goodness of fit and relate to evaluationof the predictive accuracyof the equation. They are the coefficient of determination (R2) ,the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors. The other two statistics, the correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model. The statistics we need are in the tables produced above.
Evaluating Regression Results
We will look at six key statistics used in evaluating regression results. Four are measures of goodness of fit and relate to evaluationof the predictive accuracyof the equation. They are the coefficient of determination (R2) ,the standard error of the estimate (SEE), the coefficient of variation (COV), and the F-Statistic. In different ways, each indicates how well the equation succeeds in predicting sales prices and minimizing errors. The other two statistics, the correlation coefficient (r) and the t-statistic, relate to the importance of individual variables in the model. The statistics we need are in the tables produced above.
Coefficient of Determination
There are a number of additional measures that can be used to determine how well our regression line predicts the selling price. One of the most common is R2, called the c**o**effi**c**ient o**f d**etermination (the correlation coefficient squared). R2 measures how much of thevariability in the dependent variable(sale price) is accounted for (or explained) by the regression line. That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.
Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.
In our example, we found an R2 of 0.59 – this is displayed in the chart above and in the SPSS output. The R2 statistic measures the percentage of variation in thedependent variable (saleprice) explained by the independent variable (living area). If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least – an important result, as common sense is a key factor in analyzing regression results!
Coefficient of Determination
There are a number of additional measures that can be used to determine how well our regression line predicts the selling price. One of the most common is R2, called the c**o**effi**c**ient o**f d**etermination (the correlation coefficient squared). R2 measures how much of thevariability in the dependent variable(sale price) is accounted for (or explained) by the regression line. That is, essentially, how good are the estimates of selling price based on this expression involving total square footage of living area.
Possible values of R2 range from 0 to 1. When R2 = 0, none of the variation in sales prices is explained by the model. On the other hand, when R2 = 1, all deviations from the average sale price are explained by the regression equation and the sum of the squared errors equals 0. In a one-variable model, this implies that all sales prices lie on a straight line.
In our example, we found an R2 of 0.59 – this is displayed in the chart above and in the SPSS output. The R2 statistic measures the percentage of variation in thedependent variable (saleprice) explained by the independent variable (living area). If the R2 is 0.59, this means that the regression line is able to explain about 60% of the variation of the sales prices (“variation” refers to the squared differences between sales prices and the average sale price). In practice, this can be loosely interpreted to mean total living area accounts for about 60% of the purchaser’s decision to buy a specific condo. Or, conversely, total living area determines 60% of the selling price set by the vendor, while 40% is explained by other characteristics or by random variations in price. These two statements make intuitive sense at the very least – an important result, as common sense is a key factor in analyzing regression results!
The use of R2 has two shortcomings. First, as we addmore regressionvariables, R2 can only increaseorstaythe same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.
Assume that we have regressed sales prices on eighteen independent variables and obtained an R2 of 0.920. Now suppose we re-run the model with a nineteenth variable, number of windows. As long as number of windows has any correlation whatsoever with sale price, R2 will increase to above 0.920.
The use of R2 has two shortcomings. First, as we addmore regressionvariables, R2 can only increaseorstaythe same, which can overstate goodness of fit when insignificant variables are included or the number of variables is large relative to the number of sales.
Assume that we have regressed sales prices on eighteen independent variables and obtained an R2 of 0.920. Now suppose we re-run the model with a nineteenth variable, number of windows. As long as number of windows has any correlation whatsoever with sale price, R2 will increase to above 0.920.
R2 can be adjusted to account for the number of independent variables, resulting in its sister
statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.
R2 can be adjusted to account for the number of independent variables, resulting in its sister
statistic, adjusted R2 or R2 . In the present example, the addition of number of windows as a nineteenth variable will cause adjusted R2 to fall unless the variable makes some minimum contribution to the predictive power of the equation.
The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85%” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.
The second shortcoming of R2 (shared also by R2 ) is more a matter of care in interpretation. There can be no specified universal critical value of R2; i.e., you cannot say “acceptable results have an R2 of 85%” or any other value. The critical value of the R2 statistic will vary with several factors and there are several non-mathematical reasons for variations in R2 which make setting a specific target for this statistic inadvisable.
In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately. This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction.
For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices. A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained.
A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of thevariation is only $20,000.The standard error and COV(discussed later) will show this improvement.
In mass appraisal, we often divide properties into sub-groups and develop separate model equations for each, e.g., for each neighbourhood separately. This reduces the variance among sales prices in sub-group and therefore we should not expect MRA to explain as large a percentage as when one equation is fit to the entire jurisdiction.
For example, if one model is developed to estimate sale price for all neighbourhoods in a sales database, there may be $300,000 in variation among the sales prices. A model that explains 80% of the variation, still leaves 20% or $60,000 unexplained.
A model for a single neighbourhood, with only $50,000 variation in sale price may have an adjusted R2 of only 60%, but will produce better estimates of sales prices in that neighbourhood because 40% of thevariation is only $20,000.The standard error and COV (discussed later) will show this improvement.
In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.
In general in regression models, improving the standard error and COV is more important than increasing the adjusted R2, but you should generally try to have the adjusted R2 as high as possible and the standard error and COV as low as possible.
Standard Error of the Estimate
The analyst must not only be able to estimate the equation for the regression line, he or she must also be able to measure how well the regression line fits the points. The techniques provided so far enable the analyst to determinea best fitregressionline and measure its overall goodness of fit using R2.
However,it is also desirable to find out how well the regression equation fits each individual observation. It may be that the best fit line is very accurate at representing the data, or alternatively, if the data points are highly dispersed, the best fit line may be very poor.
The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regressionline and the actuals ample observations.The SEE measures the amount of deviation between actual and predicted sales prices. If the SEE is small, the observations are tightly scattered around the regression line. If the SEE is large, the observations are widely scattered around the regression line. The smaller the standard error, the better the fit.
Standard Error of the Estimate
The analyst must not only be able to estimate the equation for the regression line, he or she must also be able to measure how well the regression line fits the points. The techniques provided so far enable the analyst to determinea best fitregressionline and measure its overall goodness of fit using R2.
However, it is also desirable to find out how well the regression equation fits each individual observation. It may be that the best fit line is very accurate at representing the data, or alternatively, if the data points are highly dispersed, the best fit line may be very poor.
The standard error of the estimate (SEE) is one measure of how good the best fit is, in terms of how large the differences are between the regressionline and the actuals ample observations.The SEE measures the amount of deviation between actual and predicted sales prices. If the SEE is small, the observations are tightly scattered around the regression line. If the SEE is large, the observations are widely scattered around the regression line. The smaller the standard error, the better the fit.
In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95% will be $19,112 or less (see Figure 2.1 in Lesson 2).
In general, you want a small SEE relative to the size of the dependent variable – in our case the selling price. Say, for example, you were running several different potential models for estimating salepricewith a variety ofvariables. You could then compare the R2 and SEE for each to see which predicts the most variation in selling price with the least associated error.
The SEE is free from the second interpretive shortcoming of R2 mentioned above. In other words, whereas R2 evaluates the seriousness of the errors indirectly by comparing them with the variation of the sales prices, the SEE evaluates them directly in dollar terms. The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from itthat tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.
In our example, we found an SEE of $9,556.24. Note that whereas R2 is a percentage figure, the SEE is a dollar figure if the dependent variable is price. Similar to the standard deviation discussion in Lesson 1, assuming the regression errors are normally distributed, approximately 68% of the errors will be $9,556 or less and approximately 95% will be $19,112 or less (see Figure 2.1 in Lesson 2).
In general, you want a small SEE relative to the size of the dependent variable – in our case the selling price. Say, for example, you were running several different potential models for estimating salepricewith a variety of variables. You could then compare the R2 and SEE for each to see which predicts the most variation in selling price with the least associated error.
The SEE is free from the second interpretive shortcoming of R2 mentioned above. In other words, whereas R2 evaluates the seriousness of the errors indirectly by comparing them with the variation of the sales prices, the SEE evaluates them directly in dollar terms. The problem with SEE is that it is an absolute measure, meaning its size alone does not tell you much in itself, and thus it can only be used in comparison to other similar models. However, you can create a further statistic from itthat tells you how well you are doing in relative terms in your particular model. By dividing the SEE by the mean of the dependent variable, you get a relative measure called the coefficient of variation or COV.
Coefficient of Variation
In our example, the SEE is $9,556. This would indicate a good predictive model when mean property values are high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.
In regression analysis, the coefficient of variation (COV) is the SEE expressed as a percentage of the mean sale price and multiplied by 100. The formula is the same as that described in Lesson 1, except that the SEE notation replaces the (standard deviation) notation.
Most regression software reports the SEE but not the COV, so we have to calculate it manually. Here, the COV is calculated by dividing the SEE (9,556.24) by the mean of the sale prices (76,593.50), yielding 12.48%. In general, for residential models which have saleprice as the dependent variable, a COV of approximately 20% is acceptable, while a COV of approximately 10% indicates a very good result.
At 12.5%, our model’s COV is acceptably small, but not fantastic. This tells us that total square footage of living area does a fairly good job of predicting sale price, but there is more to sale price than just this one variable (as we would expect!).
Our COV implies that, given a normal distribution, roughly two-thirds of sales prices lie within 12.5% of their MRA-predicted values.
Coefficient of Variation
In our example, theSEE is $9,556. This wouldindicate a goodpredictive model when mean propertyvaluesare high, but not when they are low. Expressing the SEE as a percentage of the mean sale price removes this source of confusion.
In regression analysis, the coefficient of variation (COV) is the SEE expressed as a percentage of the mean sale price and multiplied by 100. The formula is the same as that described in Lesson 1, except that the SEE notation replaces the s (standard deviation) notation.
Most regression software reports the SEE but not the COV, so we have to calculate it manually. Here, the COV is calculated by dividing the SEE (9,556.24) by the mean of the sale prices (76,593.50), yielding 12.48%. In general, for residential models which have saleprice as the dependent variable, a COV of approximately 20% is acceptable, while a COV of approximately 10% indicates a very good result.
At 12.5%, our model’s COV is acceptably small, but not fantastic. This tells us that total square footage of living area does a fairly good job of predicting sale price, but there is more to sale price than just this one variable (as we would expect!).
Our COV implies that, given a normal distribution, roughly two-thirds of sales prices lie within 12.5% of their MRA-predicted values.
Correlation Coefficient
The correlation coefficient (r) is the first of two statistics that relate to individual regression variables. As explained in Lesson 1, the correlation coefficient is a measure that indicates the strength of the relationship between two variables. It can take on values from -1.0 to +1.0, ranging from very strong negative correlation to very strong positive correlation, or somewhere in between.
In our example, the correlation between sale price and living area is 0.7696 (rounded to 0.77). This is a moderate level of correlation getting close to being strong (0.80 is considered strong). So, it seems that our simple estimate is doing a pretty good job(based on this sample data of course).There is a strong positive linear relationship between square feet and sale price. Given the regression coefficient of $72.08, as the number of square feet increases by 1, the estimated sale price increases by $72.08.
Correlation Coefficient
The correlation coefficient (r) is the first of two statistics that relate to individual regression variables. As explained in Lesson 1, the correlation coefficient is a measure that indicates the strength of the relationship between two variables. It can take on values from -1.0 to +1.0, ranging from very strong negative correlation to very strong positive correlation, or somewhere in between.
In our example, the correlation between sale price and living area is 0.7696 (rounded to 0.77). This is a moderate level of correlation getting close to being strong (0.80 is considered strong). So, it seems that our simple estimate is doing a pretty good job(based on this sample data of course).There is a strong positive linear relationship between square feet and sale price. Given the regression coefficient of $72.08, as the number of square feet increases by 1, the estimated sale price increases by $72.08.