Stats Flashcards
What is an index?
A change compared to a base value expressed in %.
Eg. Today $ / 2006 $ (x100) = 140
Equivalent to a 40% increase over 2006
Compare/contrast percentage vs. Index
Percentages compare only to its own point of reference. Index standardize the point of reference.
Absolute vs. Relative frequencies
Absolute is exact numbers. Relative is numbers within a range. Eg. 10-20
Line graph vs. Histogram
Line for relative frequencies
Bar centered on center point of range for frequency, touching other bars
EQ: mean (what, how to calc)
Sum / # of entries
Gives average value
E+, then x,y key
EQ: weighted arithmetic mean
Sum of (variable x weight) / sum of weights
Gives average
Mean vs median
Mean higher than median means bulk of data is lower and there is a right side tail (high potential outliers)
Mean lower than median means bulk of data is higher and there is a left side tail (low potential outliers)
When is mode useful?
Non-numeric data. Eg. Fav colours
EQ: Standard deviation
Looks like o with a tail
Value - mean, square it, sum them,divide by n, square root it
Gives how tightly clustered the values are. High = flat curve (loose)
Calc: sum values then ox, oy
Standard deviation intervals for a normal distribution
1 = 68.27%
2 = 95.45%
3 = 99.73%
EQ: Variance
Sum of (value - mean)^2 / n
OR std dev squared
Not very useful
EQ: coefficient of variation
SINGLE VARIABLE
= std dev / mean x 100
Shows how big the standard deviation is compared to the mean. Larger means more dispursed or more variable
BIVARIATE
=SEE / mean (dependent variable) x 100
tells us the size of the error as compared to the mean (%)
MULTIVARIATE
=same as bivariate
When is the median preferable to the mean?
- Open-ended frequency group
- Extreme values
EQ: Linear correlation coefficient (r)
Shows relationship between tow variables. Value from -1 to 1. Zero means no correlation. r > 0.8 strong, r < 0.4 weak.
BIVARIATE
Calc: x1 INPUT y1 [E+], [x^,r] ,SWAP
Cons: cannot describe non-linear relationships
Multivariate counterpart is coefficient of determination (R^2)
EQ: Sum of least squares (S)
BIVARIATE
(Actual - predicted)^2, then sum
Univariate would be variance before dividing by n
How to solve for bivariate linear equasion
x INPUT y [E+]
0 [y^,m] gives y-intercept
SWAP gives slope
EQ: Standard error of the estimates (Syx) or (SEE)
BIVARIATE is Syx
Equasion 11.7
Suggests prediction error. Is in the same unit as the variable. Lower is better
MULTIVARIATE
also known as Root Mean Square Error
Analogous to standard deviation of regression errors, same % for normal distribution (68.27%, 95.45%, 99.79%). Shows how scattered. Low = close
How do you turn a non-numerical variable into a number?
Turn each option into a yes(1)/no(0)
What are the four “goodness of fit” variables?
Coefficient of determination (R^2)
Standard error of the estimate (SEE)
Coefficient of variation (COV)
F-value
What are the two statistics that relate to the importance of individual variables?
Correlation coefficient (r)
t-statistic
EQ: coefficient of determination (R^2) and adjusted R^2
MULTIVARIATE
=correlation coefficient (r) squared
Shows how well the regression model explains the variation in the dependent variable in %. 0 (low) to 1 (high).
Weaknesses:
1. can only go higher with more variables added. Goodness of fit could be overstated by adding many insignificant variables. (Corrected by using adjusted R^2)
2. Every model is different, so no benchmarks for fit.
To improve multiple regression model, what should you look at?
First, SEE and COV, then R2
What are strata and what it the effect on R2?
Strata are groups made before modeling. Then each group gets a model. Eg. Neighbourhoods.
R2 may be less because a large part of the variation is removed already by the strata. What is left to be modeled forms the new basis for R2
EQ: f-value
(formula, meaning, benchmark, weakness)
= variance explained by regression divided by unexplained by regression
Is the model useful or no more useful than using the mean?
Tests whether model is NOT sufficient. F<4 = not significant
F>4=significant
Sensitive to number of variables/observations. High variables with low observations generally give f<4
EQ: t-statistic
Confidence. How sure are we that the coefficient is NOT zero? Bigger=better.
Outside of +-2.58 = 99%
Outside of +-1.96 = 95%
Outside of +-1.64 = 90%
Significance level 0.10 means 90% confident it is not zero. 0.05 is reliable.
Quick check list regression outputs:
- Coefficients have expected signs
- t-stat outside +-1.64, significance less than 0.10
- f-value greater than 4
- SEE approaching zero
- COV less than 20%, under 10 ideal
- adjusted R close to 1 (above 0.8)
EQ: aggregate ratio
= sum of assessment / sum of sale
Susceptible to sampling error, high outliers.
(versus mean ASR which is sum of (assessment/sale) divided by n. and gives all observations equal weight.
Explain percentiles and quartiles
Dividing points in a data set. Median is 50th percentile and 2nd quartile
EQ: average absolute deviation (AAD)
=(ASR - median) sum absolute values, divide by n
Gives average spread similar to std dev but with median
EQ: coefficient of dispersion
(Formula, benchmark, weakness)
=AAD / median
Makes comparable across groups
Good is 15 or less
Weakness: Cannot state probability of accuracy of a given assessment
EQ: price related differential (PRD) (formula, interpretation, benchmark)
= mean ASR divided by aggregate ratio of ASR
Greater than 1.00 means high$ properties under appraised (regressive)
Less than 1.00 means progressive (overassessed)
Optimim: 0.98 <> 1.03
IAAO standards (ASR, COD)
- ASR 0.90 <> 1.10
- Each stratum should be within 5% of overall for stratum
- SF res should be COD 5 <> 15
- ICI should be COD 5 <> 20
- vacant land COD 5 <> 25
Standard error of the mean (o-x)
O-x = o pop/sqrt n