STAT 102 Midterm Flashcards
Jane Jacobs (add more from lecture 2)
Cities are about the relationships and interactivity between people.
Safety thru eyes on the street.
Wanted mixed use neighborhoods.
Available urban data (add the questions at beginning of lecture)
Crime
Economics
Demographics
Land use
Variable
Characteristic that takes on different values for different individuals.
Categorical variables
Place an individual into one of several groups.
Example: race, gender, types of crime.
Can’t do direct mathematical operation on categorical variables
Continuous variables
Take on numerical values across an entire range.
Examples: height, income, population
Can do direct math with these.
How to visualize categorical data
Bar plot and pie chart.
Bar plot
Best way to visualize categorical data.
X axis is the categories
Y axis is the frequency of relative frequency of each category.
Pie chart
Another way to visualize categorical variables.
Problem is it is harder to see the actual totals of each category and distinguish small categories.
Distribution of a continuous variable
Describes what values a variable takes and how frequently these values occur.
Described in terms of center, spread, shape, outliers.
How to visualize continuous variables
Histogram, box plot, maps, linear regression
Box plots are better for center and spread and identifying outliers
Historgrams are better for looking at the shape–>
Skewness and multi-modality
Histogram
Same idea as bar plots, but for continuous variables.
Divide the x axis into bins. Y axis is the frequency or count of each bin
Boxplot
Box contains central 50% of values
Line in middle is median, 50% of values on each side.
Whiskers have rest of distribution except for outliers.
Outliers are suspiciously large or small values (<1.5IQR or >1.5IQR)
Left vs right skewed
Skewed toward the side with the longer tail.
Left skewed means long left tail. When left skewed,
mean< median.
Right skewed means long right tail. Income is always right skewed.
When right skewed, mean> median
Mean vs median
Mean used to measure center for approx normal dist.
Sum of data / n
Affected by large outliers and asymmetry so use median if skewed.
Median is the middle value, more robust measure of center. 50th percentile. Use for skewed data.
How to measure spread of dist
Variance, standard deviation, IQR.
use iqr for skewed, variance and SD for approx norm
Variance / standard deviation
Spread for a normal dist.
Variance is the average of the squared deviations of each onservation.
SUM (x - mean)^2 / (n-1)
Standard deviation is square root of variance. Where you expect most data to lie.
IQR
used for skewed data. Robust measure of spread.
Q3-Q1
75th percentile - 25th percentile.
Log transformation of data
Used to make skew distributions normal. Based on the magnitude of the value rather than the actual value.
Log = natural logarithm
Log 10 = log with base 10
Preserves the ordering of the values–> median is still the median. Not the case with mean though.
Scatter plots
The primary way to visualize the relationship between two continuous variables.
Correlation
Value between -1 and 1 that provides the sign and strength of a linear association between 2 variables.
Correlation coefficient –> how much things vary around a line.
Only appropriate for linear relationships.
R
Side by side box plots
Good way to visualize how a continuous variable changes across different categories.
Categories on x axis and numerical count on y axis.
See if differences are significant by using probability modeling.
Random variables
Used to represent an uncertain quantity or data point. Each value of a random variable has a certain probability between 0 and 1
Marginal vs conditional prob
Marginal probabilities are used to model the uncertainty in a single random variable.
P(X=x) is the prob that X will take on the value x
Conditional prob are used to model how the distribution of one random variable changes based on the value of another random variable.
P(X=x | Y=y) is the prob that X will take on the value x given that Y has the value y.
P (below poverty line | children >0) = P (BPL And c>0) / P(c>0)
P(A|B)=P(A&B)/P(B)
Normal distribution
Used to model a continuous distribution in order to obtain probabilities to the left or right of a value or within a certain range. Standard normal has mean 0 and standard deviation 1 Denoted N (mu, sigma)
Standardization and Z score
Standardization is transforming any non-standard normal distribution into a standard one.
If X has a nonstandard N dist, convert to Z –>
Z = (X-mu) / sigma
Example with normal probability:
Say we have a mean of 9.85 and sigma .55
Prob (X<9.43)?
Z = (x - mu) / sigma
= (9.43-9.85)/.55
= -.76
Prob (z<-.76) –> use table or calculator = .22
So, if you were to select a Philadelphia neighborhood, there would be a 22% chance that neighborhood has median income below poverty line.
Can also do reverse process–> how high do you need to be in top 10% of neighborhoods–> find the z score for when p=.9
Z= 1.28 = (X - 9.85)/.55
X = 10.55
Put back on the original scale to $38330
Sampling distribution (spread and standard error)
Sampling distribution used if we want to make probability statements about an entire sample of units.
Say we want to take a sample of 10 west Philly neighborhoods. What are the chances that the mean of median income across those 10 neighborhoods is below the poverty line?
Less variation in mean across several hoods than any single hood.
Variance (x bar) = sigma^2 / n
Standard deviation (x bar) = sigma / SQRT (n)
As n increases, spread decreases.
Use sample standard deviation instead of sigma because we don’t know sigma usually –> Standard error (x bar) = s / SQRT (n)
So, for that question, remember X =9.43, mu = 9.85, s = .55, and n = 10
So, Z = (9.43-9.85) / (.55/SQRT 10) = -2.41
P(Z<-2.41)=.008
Much smaller probability!
Central limit theorem
If n is large enough, then the sample mean has an approx normal distribution, no matter what shape the distribution was originally.
Confidence intervals
Used to summarize the uncertainty in any particular value that is calculated from data.
If we view W. philly neighborhoods as a sample from a population of similar neighborhoods across the U.S., we can use our sample mean as an estimate for the entire population of similar neighborhoods. Need an interval around this mean for the likely values of the true population mean.
Confidence interval for a population mean
Ex. Mean of log median incomes with n=10 is 10.23. Standard deviation = .55
(X bar - t* (s/SQRT n), x bad + t* (s/SQRT n)
Interval is centered at the sample mean
Width of the interval is a multiple of the standard error
t* comes from t distribution with (n-1) degrees of freedom.
For our example, t* with (n-1)=9 degrees of freedom is 2.26
10.23 +/- 2.26* (.55/SQRT 10) = (9.84,10.62)
Transform these back to the original scale. (18770,40946)
We are 95% confident that the average median income across allí similar neighborhoods in the U.S. is between $18770 and $40946.
Hypothesis testing
Used to establish whether a difference calculated from data is large relative to the uncertainty in that data.
Start with a specific hypothesis –> example:
West Philly neighborhoods have an average median income of $35525. The average median income across all census blocks in the country is $42750. Is this difference substantial enough for us to say that West Philly is poorer than the rest of the country?
Question is if the difference in these values is large enough relative to the variability in the data so that west Philly is significantly less than national average.
Hypothesis test steps
- Form a null/ alternative hypothesis
- Calculate a test statistic
- Calculate the p value
Hypothesis test steps
- Form a null/ alternative hypothesis
- Calculate a test statistic
- Calculate the p value
Null vs alternative hypothesis
Null is Ho, an assumption that there is no effect or no difference. Usually when the mean = something
Alternative is Ha, saying there is a difference. Usually if mean doesn’t equal or is >/< than something
Null vs alternative hypothesis
Null is Ho, an assumption that there is no effect or no difference. Usually when the mean = something
Alternative is Ha, saying there is a difference. Usually if mean doesn’t equal or is >/< than something
Test statistic
Test statistic summarizes the discrepancy between the observed data and Ho
Answers the question–> “How many standard errors is our observed sample variable from the hypothesized value?”
T = (x bar - mu not) / (s / SQRT n)
X bar is the value from our sample. Mu not is the value that Ho is equal to. S is the standard deviation of the sample data. N is the sample size.
Test statistic
Test statistic summarizes the discrepancy between the observed data and Ho
Answers the question–> “How many standard errors is our observed sample variable from the hypothesized value?”
T = (x bar - mu not) / (s / SQRT n)
X bar is the value from our sample. Mu not is the value that Ho is equal to. S is the standard deviation of the sample data. N is the sample size.
P value
Assuming the null hypothesis is true, the p value is the prob we get a test statistic as large as our observed test stat
If p< alpha (usually .05), then we reject Ho and conclude Ha.
If p> alpha, then we fail to reject Ho and cannot conclude Ha.
P value
Assuming the null hypothesis is true, the p value is the prob we get a test statistic as large as our observed test stat
If p< alpha (usually .05), then we reject Ho and conclude Ha.
If p> alpha, then we fail to reject Ho and cannot conclude Ha.
Hypothesis test example:
West Philly neighborhoods have an average median income of 10.23 on the log scale. The average median income across all census blocks in the country is 10.66 on log scale. Is this difference sisngifcant enough to say west Philly is poorer than rest of the nation?
S= .55 and n= 10
Ho: mu = 10.66
Ha: mu < 10.66
T = (x bar - mu) / (s/SQRT n) = ((10.23-10.66))/ (.55/SQRT 10) = -2.47
P (T<-2.47) with df = n-1 = 9 = .0178
P < alpha, so we reject Ho and conclude there is a statistically significant difference between our sample mean and our null hypothesis mean. In other words, we conclude that West Philly neighborhoods have, on average, significantly lower median incomes than the US national average.
Hypothesis test example:
West Philly neighborhoods have an average median income of 10.23 on the log scale. The average median income across all census blocks in the country is 10.66 on log scale. Is this difference sisngifcant enough to say west Philly is poorer than rest of the nation?
S= .55 and n= 10
Ho: mu = 10.66
Ha: mu < 10.66
T = (x bar - mu) / (s/SQRT n) = ((10.23-10.66))/ (.55/SQRT 10) = -2.47
P (T<-2.47) with df = n-1 = 9 = .0178
P < alpha, so we reject Ho and conclude there is a statistically significant difference between our sample mean and our null hypothesis mean. In other words, we conclude that West Philly neighborhoods have, on average, significantly lower median incomes than the US national average.
One sided vs two sided testing
One sided means Ha is > or <. Two sided is not equal to a value.
If you use a two sided test, need to double the p value because both tails.
More conservative to use a two- sided alternative since a doubled p value makes it harder to reject Ho.
One sided vs two sided testing
One sided means Ha is > or <. Two sided is not equal to a value.
If you use a two sided test, need to double the p value because both tails.
More conservative to use a two- sided alternative since a doubled p value makes it harder to reject Ho.
Connection between intervals and tests
If confidence level C is equal to 1 - alpha, then:
A two sided hypothesis test rejects the null hypothesis (mu = mu not) if our hypothesized value mu not falls outside the confidence interval for mu.
Ex: we got 95% CI for log of median income based on West Philly to be (9.84,10.62). Since national average is mu not = 10.66 which is not in this interval, we could have inferred it would be rejected before even doing hypothesis test.
Connection between intervals and tests
If confidence level C is equal to 1 - alpha, then:
A two sided hypothesis test rejects the null hypothesis (mu = mu not) if our hypothesized value mu not falls outside the confidence interval for mu.
Ex: we got 95% CI for log of median income based on West Philly to be (9.84,10.62). Since national average is mu not = 10.66 which is not in this interval, we could have inferred it would be rejected before even doing hypothesis test.
Type I error
False positive
Reject Ho but Ho is true.
Ex: convicting an innocent person–> Ho says he’s innocent.
We reject Ho (say he’s guilty) but he’s really innocent.
That’s bad.
Probability of type I error = alpha value.
Means that we will accept 1 false positive for every 20 tests (alpha = .05
Type I error
False positive
Reject Ho but Ho is true.
Ex: convicting an innocent person–> Ho says he’s innocent.
We reject Ho (say he’s guilty) but he’s really innocent.
That’s bad.
Probability of type I error = alpha value.
Means that we will accept 1 false positive for every 20 tests (alpha = .05
Type 2 error
Fail to reject Ho but Ho is false.
False negative.
He’s actually guilty, but we don’t convict him.
Probability of this is inversely related to alpha
Type 2 error
Fail to reject Ho but Ho is false.
False negative.
He’s actually guilty, but we don’t convict him.
Probability of this is inversely related to alpha
Multiple testing and Bonferani correlation
If we test lots of hypothesis at the same time, we could run into problems. If we tested our W. philly income values separately across income values from 1000 different neighborhoods across the U.S., we could expect 50 false positives with the alpha value of .05
So, we use the Bonferroni correction because that’s a lot of errors.
Use alpha = .05/ m where m is the number of tests being done.
More consersavtive because harder to reject Ho.
Multiple testing and Bonferani correlation
If we test lots of hypothesis at the same time, we could run into problems. If we tested our W. philly income values separately across income values from 1000 different neighborhoods across the U.S., we could expect 50 false positives with the alpha value of .05
So, we use the Bonferroni correction because that’s a lot of errors.
Use alpha = .05/ m where m is the number of tests being done.
More consersavtive because harder to reject Ho.
Contingency tables
To graphically summarize the relationship between two categorical variables.
Each cell is one particular combination of both categorical variables.
Contingency tables
To graphically summarize the relationship between two categorical variables.
Each cell is one particular combination of both categorical variables.
Marginal and conditional probabilities and sign for association
Marginal prob of poverty –> P (poor) = .137
Conditional prob of poverty –> P(poor | black) = P (poor and black)/P(black) = .072/.449 = .160
The fact that these probabilities are different is evidence there is an association between race and poverty.
If they were the same then there’d be no association, as the given variable wouldn’t influence the prob.
Marginal and conditional probabilities and sign for association
Marginal prob of poverty –> P (poor) = .137
Conditional prob of poverty –> P(poor | black) = P (poor and black)/P(black) = .072/.449 = .160
The fact that these probabilities are different is evidence there is an association between race and poverty.
If they were the same then there’d be no association, as the given variable wouldn’t influence the prob.
Chi squared test for association
Use to see if association between two categorical variables
Expected counts vs observed counts–>
Table of expected counts by multiplying margins and dividing by total
X^2 = sum of ((observed - expected)^2 / expected)
Ho: no association between variables
Ha: association
(r-1)(c-1) degrees of freedom for the chi squared distribution
On JMP output, use the likelihood ratio chi squared value
The fishers exact test is an alternative when n is small
One problem is the statistic depends on both the number of observations and the number of categories –> difficult to compare strength of association across different datasets.
So, use Cramer’s V
Chi squared test for association
Use to see if association between two categorical variables
Expected counts vs observed counts–>
Table of expected counts by multiplying margins and dividing by total
X^2 = sum of ((observed - expected)^2 / expected)
Ho: no association between variables
Ha: association
(r-1)(c-1) degrees of freedom for the chi squared distribution
On JMP output, use the likelihood ratio chi squared value
The fishers exact test is an alternative when n is small
One problem is the statistic depends on both the number of observations and the number of categories –> difficult to compare strength of association across different datasets.
So, use Cramer’s V
Cramer’s V
To scale chi squared so it’s on a 0-1 scale
V = SQRT (chi squared / n(r-1)(c-1))
If V = 0, then no association.
Large values mean strong association.
Still need chi squared test to determine if there’s an association.
Summary of association between 2 categorical variables
Contingency table for observed counts
Measure strength of association by using chi squared and Cramer’s V
Chi squared test to see if observed association is stat significant
Significant association between dominant race of neighborhood and poverty level
Comparing two proportions
2 sample z test for prop determines if there is a significant difference btwn proportions.
2 sample z test for prop
For two different groups, test whether or not the two prop are different
Ho: p1=p2
Ha: p1 not equal, >, or < p2
Test stat: Z = (p hat 1 - p hat 2) / SE
SE = SQRT ( (p hat 1(1- p hat 1) / n1) + (p hat 2 (1-p hat 2) / n2))
Used to compare proportions between two specific categories and test for significant difference between those 2 prop.
Use 2 sided z test (multiply p value by 2)
Confidence interval for diff in prop
Conf interval–>
(P hat 1 - p hat 2 +/- Z* (SQRT ( (p hat 1 (q1) / n1) + (p hat 2 (q2)/n2)
Interval centered at the difference between prop
Z* is 1.96 for 95% CI
For the difference in proportion of very poor neighborhoods across race, we see predominantly black neighborhoods are between 2.9% and 7.9% more likely to be in very poor category. CI is (.029,.079)
Comparing continuous variable between different categories of categorical variable
Use two sample t test for means when comparing two continuous distributions. Compare the means
Two sample t test for means
When comparing 2 cont distributions, focus on comparing the means
Ho: mu 1 = mu 2
Ha: mu 1 doesn’t equal mu 2
Test stat: T = (X bar 1 - X bar 2) / SE
SE = SQRT ((s1^2/n1) + (s2^2 / n2). Make sure I can use the JMP output to do this
Get the test stat and T dist. Df approx is minumum of (n1,n2)-1
Test for very poor vs wealthy –> p < .0001
Therefore, reject Ho and conclude very poor neighborhoods have significantly different violent crime than wealthy neighborhoods.
Confidence interval for difference in means
(X bar 1 - x bar 2 +/- t* SQRT ((s1^2/n1) + (s2^2/n2))
Centered at diff between means
T* comes from software. Good approx is minimum of n1 and n2 -1
95% CI for diff in very poor prop between black and white neighborhoods:
(44.48,99.86)
Very poor neighborhoods have had somewhere between 45 and app more violent crimes on average than wealthier neighborhoods.
Summary of comparison across categories
Examined the different types of comparisons you can do between groups or specific categories of a categorical variable.
If we are comparing a binary variable between 2 groups, do a 2 sample z test for prop and confidence interval for diff in prop
If we are comparing cont variable between two groups, do two sample t test of means and conf interval for diff in means.
Comparing two continuous variables
Scatter Plot!
How to describe scatter plot
Direction, linear?, strength
Positive/negative association
Roughly linear
Weak or strong
Covariance between 2 variables
The spread between 2 variables is captured by the covariance
The measure of the strength of a linear association between 2 cont. variables
More oval shape suggests association rather than circle of points randomly around the origin.
Depends on the units of both variables.
= 1/(n-1) * sum of (xi - x bar)*(yi - y bar)
Correlation between two variables
Unit-less measure of the strength of a linear association by scaling by the standard deviation of each variable.
r = Cov (X,Y) / (SD(X)*SD(Y))
Range between -1 and 1
0 means no linear relationship, 1 means exactly positive linear
Only used if relationship is roughly linear.
High correlation does not imply a linear relationship (look at residual)
Low correlation Implies no linear relation, not no relation.
Hypothesis testing for a correlation
Testing for a significant linear association is equivalent to testing for a correlation that is significantly different from 0
Ho: rho = 0 vs Ha: rho not equal to 0
Rho is true slope
Test stat: T = r-0/ SE (r)
Standard error is too complicated to calc by hand
Small p value suggests we reject Ho, there is a significant association between poverty and violent crime.
Least square regression line
Best fit line. Statistical model that represents the relationship between two variables, which we can use for extra insight as well as prediction.
Want the line with the smallest total residuals.
SSR = sum of (yi - y hat i)^2
Or sum of (yi- (Beta not + Beta 1 * x i) ^ 2
Sum of squared residuals.
Equation of a linear relationship
Yi = beta not + Beta 1 * x i
Beta not is the intercept
Beta 1 is the slope
They are both regression coefficients
Residuals
Error between observed and estimated Y values from the equation of the line
Y- y hat
Actual - estimated
Slope and intercept best fit line
Slope = b1 = r * sy/sx Intercept = bo = y bar - b1*x bar
Summary of linear relationships between continuous variables
We graphically explore the relationship between 2 cont variables with a scatter plot, giving us the direction and whether linear or non-linear association.
Measure strength of linear relationship with the correlation and test the correlation for significant association
Best fit line as a mathematical model of the linear association
Interpreting the slope of LSRL
Example:
Poverty on x axis and violent crime on y axis and slope is 290
The slope, b1, is the average change in the Y variable that is associated with a one unit change in the X variable.
Ex: a one unit increase in poverty is associated with an average increase of 290 violent crimes in a neighborhood over the 2006-15 period.
Depends on the scale of the X variable.
A better interpretation would be that a .1 unit increase in poverty is associated with an average increase of 29 violent crimes in a neighborhood
NO CAUSATION. ASSOCIATION
Interpreting the intercept
The intercept, bo, is the average value of the Y variable when the X variable is zero.
There is an average of 40 violent crimes over the 2006-15 period in neighborhoods with a poverty of 0
Caution: the intercept should only be observed if the value of X=0 is within range of observed data.
For the education vs violent crime, no cities with education of 0, so don’t interpret the intercept.
Can’t extrapolate!
Prediction with a linear model
Use the best fit line Y new = bo + b1 * x new Plug In New x value We predict 214 violent crimes over a ten year period in a niehbbhornood with poverty = .6 Don't extrapolate! Needs to stay within our range.
Root mean square error
Interpretation ex:
Poverty on x axis, violent crime y axis. RMSE = 89
The average size of our prediction errors
Or the average size of the residuals
RMSE = SQRT (sum of (y-y hat)^2 / (n-2))
Interpretation:
If we predict violent crime in neighborhoods based on its poverty level, we get, on average, within 89 crimes of the observed violent crimes in those neighborhoods.
Compared with just using the standard deviation –> SD = SQRT (sum of (y - y bar)^2 / (n-1)) = 101.6
We reduced the average size of our errors in predicting crime from 101.6 to 88.7 by forming the LRSL instead of just using the mean to predict.
R squared
Ex:
Interpretation with poverty on x axis and violent crime y axis and r squared = .24
The square of the correlation
The fraction of variation explained by the model
= 1 - sum of squared errors using model / sum of squared errors using mean
Ex:
24% of the variation in violent crime is explained by our linear model.
Summary of interpretation and prediction with linear model
Interpreted slope and intercept
Use model to make predictions
RMSE as average size of our prediction errors
R squared as the fraction of variation explained by linear model.
Simple linear regression (SLR) model
Foundation the same but add the residuals
Yi = Beta not + Beta 1 * xi + epsilon i
Epsilon i is the errors or the residuals between each point and the line
Simple linear regression model includes a prob model for those errors.
Epsilon ~ Normal (0, sigma ^2 sub epsilon)
Meaning normal with mean at 0 and the standard deviation of residual is the typical size of the error around the line.
Model the distribution of points around the line
Assumptions of the SLR model
- Linearity
Residuals: - Independence: the residuals for one observation is independent of the residuals for another observation.
Ex. Predicting the stock market over time. If I mess up moneys, it will likely mess up Tuesday too - Equal variance: the residuals all have the same variance, sigma ^2 sub epsilon. No matter where I am on the line, observations should be spread about the line on both sides.
- Normal distributed: residuals are normally distributed around the line.
Residual standard deviation
Sigma hat sub epsilon = RMSE = SQRT ( sum of (y i - y hat i)^2) / (n-2))
Uncertainty in a simple linear regression model
2 levels of uncertainty:
1. Line itself
We have uncertainty about our estimated slope and intercept. We calculate the best fit line for our observed data but that line may not be exactly correct.
- Individual points around the line
When we make predictions for individual observation using our line, how much error is there in our predictions?
Uncertainty about the line itself
The least squares slope and intercept are our best estimates of true slope and intercept, but prob not exactly correct
Our uncertainty about the line has 2 consequences:
1. We need to incorporate this uncertainty into predictions we make from the line
2. The uncertainty affects whether or not the linear relationship is significant, since the true slope could be closer to 0
Hypothesis tests in the SLR model
Given uncertainty in estimation of slope and intercept of linear model, how do we know if significant relationship?
Hypothesis test for slope:
Ho: beta 1 (true slope) = 0
Ha: beta 1 not equal to 0
Test stat: compares estimated slope to null hypothesis slope of 0, while accounting for standard error of estimated slope.
T = b1 - 0 / SE (b1)
SE (b1) = RMSE / SQRT (n-1) * (1/standard deviation of x)
SE (b1) is given next to the x variable in table under standard error. And the t ratio next to the x variable is our test stat.
Use t distribution with n-2 dergeees of freedom
Reject Ho, conclude significant linear relationship between poverty and crime
Confidence intervals for regression coefficients
Conf interval for slope:
(b1 +/- t* (for n-2 dist) * SE (b1))
Same thing for intercept.
For poverty (x axis) vs crime linear model we get slope as (262,318)
A 1 unit increase in poverty is associated with an average increase of between 262 and 318 violent crimes in a neighborhood over the 10 year period.
Confidence bands using JMP
visualize the uncertainty in the line by adding confidence bands to the scatrerplot
shaded fit option
Gives you a shaded band allowing me to summarize confidence in the line. True line anywhere in the band.
Uncertainty in SLR about individual points around the line
When we make predictions for individual observations using our line, how much error is there in our predictions?
Instead of just predicting a y value when we enter an x value, we know, from the RMSE that we expect to get, on average, within the RMSE number of observed Y value.
The assumptions of the SLR are relevant:
Residuals have: Independence, equal variance, and normally distributed
So, we can make a 95% CI for the predicted value –> where y hat new is the single expected value we get
(Y hat new) +/- t* (df n-2) * SE (y hat new)
SE (y hat new) = RMSE * SQRT (1 + (1/n) + (xnew - x bar)^2 / (n-1)*sx^2)
The second part is just the uncertainty around the line. Smaller part of the calculation, mostly driven by the RMSE
This gives a prediction band
Summary of testing and prediction with SLR
SLR model introduced based on our best fit line equation
Examined uncertainty about the line itself: presented hypothesis testing for significance of a slope and CI for the slope and intercept
Uncertainty about individual observations around the line: we presented prediction intervals and prediction bands for individual observations around the line.
How can we determine if our data is non-linear?
Look at the residual plot
Plot the residuals (y-axis) vs the predicted values of y (x-axis)
If we see a strong pattern in the residuals, then it suggests that this model is not the best model.
Non-random pattern in residual shows that the relationship is not linear.
Modeling non-linear relationships
Transform the Y variable or the X variable to make the relationship more linear
Exponential curves / log transformation
Interpret slope and intercept too
Exponential curve is a line between X and log Y
Log y = Beta not + Beta 1 * x + epsilon i
A 1 unit change in X is associated with a 100% * Beta 1 change in Y.
For example, say slope is .0000254 for income on x and log violent crime on Y
A one dollar increase in income is associated with a .00254% decrease in violent crime.
Better interpretation: a 1000 dollar increase in income is associated with a 2.54% decrease in violent crime
Log-log transformation and interpretation of slope
Take log of both X and Y if this makes it more linear (look at scatter plot and r squared and RMSE on a scale)
Interpretation of slope: a 1% change in X is association with a Beta 1 % change in Y
Example: log income on X and log violent crime on Y
Slope is -.781
A 1% change in income is associated w a .781% decrease in violent crime.
Looks like a demand curve, downward sloping toward 0
Y vs log (x)
As x increases by 1%, average y increases by .01 b1
Dimishing returns
Summary of the non-linearity with the SLR Model
We can fit non-linear relationships by transforming the X or Y variable and then using our SLR framework
Log transformations are the most common, but other types work too
Interp of the slope changes when we transform our X or Y variables.
Equal variance assumption of SLR
Residuals should all have the same variance.
Meaning as x increases, we should see homoskedaskity. No fanning out pattern, but rather residuals stay the same as x increases
Diagnosing Heteroskedasticity
In our poverty vs violent crime data, we see a fanning out pattern, known as heteroskedasticity
If we look at the residual plot, residuals get much bigger as x and y increase
Consequences of Heteroskedasicity
The RMSE should represent the size of a typical error or residual.
But the RMSE is too large for small values of x when the residuals are small and is too large for large values of x when the residuals are large
When we make 95% prediction bands, our prediction bands are too large for small x values and too small for large x values.
Possible solution for heteroskedascity
Re scale the residuals so they don’t grow with the x variable
Divide the entire equation by x
So it is
Y1/X1 = gamma not + gamma 1 (1/x1) + residual
the problem is this now plots Y/x vs 1/x, which is not easily interpreted.
And it doesn’t completely eliminate the problem.
Normal distributed assumption
The residuals should be normally distributed around the line.
Diagnosing non normality
Make a normal quantile plot.
This is a plot to see if the residual histogram matches normal distribution.
The plot should stay within the red bands. If not, then not normal.
Consequences of non normality
If residuals aren’t normally distributed, then we can’t trust that the 95% prediction bands actually contain 95% of the observed y total.
Possible solution for non normality
Log transformation–> log transform the y variable and see if the residuals look more normally distributed.
Only good solution if the relationship between log y and x is still roughly linear. Make sure it doesn’t screw up the fundamental linear relationship.
Independence assumption
The residual for one observation is independent of the residual for any other observation.
Diagnosing dependent residuals
Plot the residuals vs the rows of the data set. We shouldn’t see a pattern.
Autocorrelation is seeing this pattern, which we often see in data with time.
If residuals show autocorrelation (dependence over time), we can model that into our model. But not yet. Wait till ch 27!
Outliers with SLR
Check for outliers, which could have influence
Remove the outlier and see how the line changes. You can justify keeping it in or removing but be sure to explain why.
Summary of violations of SLR
- Non linear
- Heteroskedastiicy (unequal variances) in residuals
- Non normality of the residuals
- Dependence in residuals (autocorrelation)
- Outliers