Lesson 3 Flashcards
You want to “smooth” the data relationship in a scatter plot. How might you do this?
A best fit trend line is added to the data with an automated least squares regression which takes a data array with wide variability and “smooths” the result to lose clearly and quickly understand the relationship between the variable.
What would be the best tool for analyzing the relationship between two continuos variables?
Scatter plots
If the correlation between two variables is strong, is the relationship deemed to be linear?
Not always. Two variables may have high correlation but exhibit a nonlinear relationship.
How can the coefficient of variation be used to determine appropriate units of comparison for real estate data?
COV = Standard deviation / mean
In combination of various independent and dependent variables the strong relationship between variables would be the ones with the lowest COV, indicating that those variables should be further tested.
You hire a consultant to complete a statistical analysis predicting the need for seniors housing in Langley. In reviewing the results, should you focus on the reliability of the forecast in relation to other benchmark data? Or do you need to examine the consultants interpretation of the underlying data relationships?
Both approaches will be necessary. Data exploration would be necessary to determine the strengths and weaknesses of the statistical analysis. The results may appear reasonable, but end up not adequately supported by the underlying data and analysis.
How could you use visual presentation aids to help a client understand a statistical analysis?
Visual aids such as graphs, provide the opportunity to simplify complex relationships between data variables so that the key messages about the data become clear.
Real estate terminology is very specialized. A variable describing office building class would be what type of data variable? What possible problems might you experience in relying on this building class variable?
The Building Class variable is an example of an “Ordinal” variable. Each class is related to the other and provides an indication of which class is “better” than another, but not any objective indication of how much “better” one class is in relation to another. The problem with ordinal data variables is that they are often based on subjective interpretation.
Do you agree with the following statement: “The goal of exploratory data analysis is to identify and account for every source of variation in data relationships?”
It is often impossible to account for every source variation in the relationship between two or more variables. Procedures have been developed in statistical analysis to identify various sources of variation in the relationship between data variables, but there will usually be some random variation that cannot be explained. Data occurrences that do not follow established data relationships are often described as “outliers”.
You are looking at a histogram with a normal distribution. If some data was removed from. The dataset, resulting in the median being lower than the mean, how do you think the new histogram would look?
The histogram should be skewed to the right, meaning the data is clumped on the left with a long tail extending right of the median and mean. The extent of the “skewness” would depend on the amount of data removed and the resulting impact on the median and mean. The importance of this point is that the mean, by itself, is not a complete measure of central tendency.
You have a database of recreational lot sales and are forecasting sale price per front foot for a certain size of waterfront lot. How can you account for non-linear data relationships in your forecast?
First, identify the nonlinear data relationships and appropriate units of comparison using data exploration and revelation (graphical analysis). Then convert the data to a linear relationships using logarithms. The data can then be re-tested to determine the strength of the relationship between logarithmic variables. The coefficient for the regression equation (independent variable) represents the exponential relationship between the two variables.
Consider three sample datasets drawn from the same population of high-rose condo sales in Vancouver. The datasets include a number of variables: sale price, unit size, floor height, view, and parking. The descriptive statistics for each dataset indicates similar mean and median values for sale price per front foot. What following steps should you take in comparing and analyzing datasets? What should you be attempting to uncover?
Look to see if the data is indeed comparable or if they are desperate markets. Start by graphing each variable to understand its distribution. It is important because the number of statistical measures are based on the assumption that data follows a normal distribution. Outliers and other anomalies will become apparent through scatter plots and histograms. Recoding data and re plotting in a box plot will illustrate patterns in data and the strength of various relationships between data variables. This initial data exploration is critical if later tasks in model building are to be successful. You may be able to combine the datasets if they combine the datasets if the appear to be in the same market.
If two data variables, say, price per square foot and finished floor space in new housing, had a linear relationship, what would be an easy way of determining the linear regression equation?
Create a scatter plot of the dependent (price per square foot) and independent variables (finished floor space in square feet). From the scatter plot measure the slope of the regression line, and estimate the projection of the intersection of the line on the Y-axis where X=0 . This point will be the constant in the equation. The slope will be the regression coefficient. If the regression represents an inverse relationship, the coefficient will be negative.
If the data doesn’t have a linear relationship and strong correlation, what would be the risk of relying on a regression equation where most of the data occupancies were concentrated at one end of the regression with few occupancies at the other end?
The slope of the regression line would be very sensitive to the location of a few data occurrences because of the nature of the least squares calculations. Aka the coefficient of the independent variable could be dramatically affected by one or two data points, resulting in predications of the dependent variable which contain high potential error.
Assume you are comparing two datasets for apartment rents in Victoria. One dataset reflects 3-story “walk up” apartment rents in the James Bay community and the other dataset includes similar property rents in another Victoria community, Fernwood. You want to determine the effect of location on rent. Assuming both datasets have a similar structure, what approach would you use to compare the datasets?
Step 1- explore the descriptive statistics for each data set by examining the correlation for various combinations of dependent and independent variable. Can the strength of relationships be clarified by recoding or transforming data variable? Use graphical tools to tear the two means, assuming the data in both datasets has similar distribution, will provide a measure of the locational difference.
When exploring data for the first time, preliminary screening is important so you can:
1) seek patterns in the data
2) understand relationships within the data
3) eliminate data you do not need or identify data that seems odd or impossible
4) all of the above
(4) all of the above are key tasks for the initial phase of data exploration