Exam 2- Regression Flashcards
Y=a+bx
Y-hat reminds us that we have deviations about the line and that values for y specified by the line are PREDICTIOnS a - intercept b - slope ^ Y- predicted value if y for a given x
Statistical model
An equation that fits the pattern between a response variable and possible explanatory variables, accounting for deviations from the model. Or in other words, a regression line
What does y intercept tells us?
The value of y when x=0
What does slope tell us?
The change in y for every one unit increase in x , on average!
As x increases by one unit what happened to the y when slope is negative?
Y decreases
As x increases by one unit what happens to y when slope is positive?
Y increases by rise/run units
b=
Rise(y)/run(x)
Interpretation of slope : rise/run
For every inch increase in height at age 4 , height increases by 1.15 inches ON AVERAGE at age 18
Interpretation of y- intercept
Males who are zero inches tall at age 4 will be 23 inches tall at age 18
The intercept is the value of y when x=O
How to predict
- collect data
- plot data
- predict
- fit the data with a straight line equation
- evaluate the equation
Residuals
Vertical distance from the observed y value and the line , or
The difference between observed y value and y-hat , the value predicted by regression line
Squared Prediction error (residual)2
(Observed y - predicted y)2= (Y - Y(hat)) squared
They are squared because the sum of two residuals are normally equals to zero ( negative residual plus positive residuals above and below the line)
Positive residuals
Points above the line
Negative residuals
Points below the line
The least-squares residual line is
The line with the smallest sum of squares errors (denoted SSE)
Sum of Squared Deviations (residuals, errors (SSE) represents
The total variation in observed values of y Sum residuals2( squared) = ( y - y-hat) squared
Least - squares equation
Y-hat=a +bx
Formula for a (intercept)
a=y-bar - bx(bar)
Where y and x are the respective means
Formula for b(slope)
Slope is a rate of change, the amount of change in y for a given value of x when x increases by 1
b=r Sy/Sx
Least-squares regressions line facts
- makes the distance of the data points from the line small Only in Y direction
- if we reverse the roles of two variables we get different least squared regression line
What is the connection between correlation r and the slope b of the least squared line?
Slope and r have the same sign B=r only when Sy=Sx Both r and b tell us the direction If r=0 b =O If ro b>0 If we know sign of r we know sign of b and vise versa
What b and r have in common
Always have the same sign
A change of 1 standard deviation in x corresponds to a change of r standard deviations in y.
Change in y(hat) is less then change in x
The least squares regression line always passes
Through the point (x bar;y bar)
Correlation r describes
The straight line relationship
The square of correlation r 2 gives us
The percentage % of Variation in the values of y that is explained by the least squares regression line
On the chart R-sq=0.6937 or 69.37%
Regression line
Is a straight line that describes how a response variable y changes as an explanatory variable x changes
Least squares line is a math model used to predict
The value of y for a given x
Y = a +bx
Least squares regression line requires that we have
Explanatory and response variables, quantitative
The least squares regression line of y on x is the line that makes
The sum of the squares of the vertical distance of the data points from line as small as possible
The least squares regression line as any line has
Slope and intercept
Chance of y into Yhat
Slope b =r(Sy/Sx) Where r is correlating factor and s are standard deviations for both x and y
When r2 is close to 0 zero the regression line
Is not a good model for the data ; hamburger shape , no relationship between x and y explained by regression line
When r2 is close to 1
The regression line should fit the data well or almost 100 % of variations in y are explained by x
The coefficient if determination r2
represents the fraction (%) of the variation in the values of y that is explained by the least squares regression of y on x.
Regression is a common statistical setting and least squared regression is most common method for
Fitting a regression line to data
Least squares regression line always passes through
The point x and y
Residual
Difference between an observed value of the response variable y and the value predicted by regression line y-hat
Residual = observed y - predicted y or y-hat
The residual show
How far the data is from the regression line and how well the line describes the data.
The mean of the least squared residuals is
Always zero!
A residual plot (diagnostic plot)
Is a scatter plot of the residuals versus the observed x values ( or y-hats ) which lay on the regression line
If the residual plot shows uniform scatter of the points about the fitted line
Above and below with no unusual observations or systematic pattern, then the regression line captures the overall relationship well
Residual plot - curved pattern
Relationship is not linear
Residual plot - megaphone
Increasing or decreasing spread about the line x indicates that prediction of y will be LESS accurate for larger x’s
Individual points with large residuals are
Outliers in the vertical direction
Influential observation
Is an outlier in either x or y direction which if removed would markedly change the value of the slope and y- intercept
Outlier
An observation that lies outside the overall pattern of the other observations
Ecological correlation
A correlation based on group mean averages rather than on individuals .
Correlation measures
Direction and strength of linear relationship of quantitative variables x and t
Regression models
The linear relationship between x and y and can be used to predict a value for the response variable y for a specific value of the explanatory variable x
What is total variation?
Sum of squared deviations about y-bar
What is unexplained variations?
Sum of squared residuals or variations not explained by regression line
Regression assumptions:
The relationship between x and y can be modeled by a straight line ( residuals show randomness around the line)
Variations in Y’s about the line does not depend on values if x ( residuals are similar in size for all X’s)
If residuals conditions (assumptions) are met
Shoes box or There is no pattern in the residuals
Smile or frown pattern in residual plots indicate
Non-linear relationship - violation of conditions (assumptions)
Megaphone pattern in residual plot indicates
Non-constant variations ( variation in y is dependent on x)
Shoe box residual plot with a point outside indicates
Outlier in either x or y direction
An estimated statistical model-
Regression equation
Regression equation is an
Estimated statistical model
r2 is a measure of how
Successfully the regression explains the variation on the response, y
The sum of squared residuals measures …… Variation
The unexplained
R-sq is a measure of the fraction of variation in y that is …. Not explained by X
R-sq = 1 - unexplained var/total var
Not explained by x
Residual plot help us to magnify the residuals and identify ….. Sometimes we can see ….. Observations and …… Which are much more visible on the residual plot.
Problems.
Unusual observations
Patterns
A residual plot is a ….. Of the x-values plotted against the residuals
Scatterplot
Correlations based on ….. Rather then on …… Can be misleading if they are interpreted to be about individuals
Averages…..on ondividuals
Removing influential point from the data set will change …
Slope and y-intercept