Lecture 9 - Regression Depression Flashcards
Why is regression analysis used so often?
It models a predictor for the future, or further along the gradients measured?
How is the independence assumption evaluated?
By plotting the residuals against increasing observed values (fitted)
occurs when the residuals fluctuate uniformly about 0 (no pattern)
ie. no unbalanced + or - groups
When does a lack of independence with residuals occur?
When adjacent residuals tend to be similar and thus appear to be correlated = autocorrelation
ie. groupings that consistently fall below or above the line
What is positive autocorrelation (dependence?)?
Positive residuals are followed by other positive individuals
What is negative autocorrelation?
negative residuals are followed by other negative residuals
What is the purpose of regression depression?
To examine linearity between two variables
Determine if a linear relationship exists between two variables
If the R2 value is good, what might still lead us to be concerned with what a scatter plot depicts?
If the scatterplot shows a plateau
Could mean constraints on data
Do we want to fit a curve to the data (polynomial)?
Not desirable to curve data and fit a polynomial because it gets complicated and is not very suitable for biology
Sometimes we might transform the data in this situation
What is deriving R2 and the decomposition of variability?
a least square analysis to fit the best line that reduces the variability around the dependent response variable (y)
What is the Y-bar?
Mean of all x and y values (pivots variability around dependent response?)
What is (B) in the decomposition of variability?
linear distance from observed to expected = residual or error or unexplained term
(yi-y-hat-i)
What is yi?
the observed value
What is y-hat-i?
the expected value
What is (C) in the decomposition of variability?
The Model (y-hat-i - Y-bar) and is linear distance from the expected to the mean of all x and y values (Y-bar) =model or regression #
What is (A) in the decomposition of variability?
The two components of (B) and (C)
where (A) = (B)+(C) to = the total variability
Total variation equation = ???
Total variation (A) = Residual (error/unexplained/B) + Model (explained/regression#/C)
Why do we sum the squares of the residuals?
To get rid of negative differences from the difference between the expected yi and the Y-bar mean of the x and y’s
Negatives would make it add up to 0 or some other weird number
What is the equation for the Total residuals (SSt)? And how does this relate to the A B C?
Total SSt = Residual SSe (B) + Model SSm (C)
What is SSe a measure of?
SSe is a measure of how well the regression line fits the actual data (difference between observed and expected values)
What is SSm a measure of?
SSm is a measure of how different the line y-hat-i is from Y-bar (how different is the slope from 0)
What is the equation for the Coefficient of determination R2?
R2 = SSm (C)/ SSt (A, total)
What is the Model referring to in an ANOVA table?
The treatment/factor
How do you determine F with the ANOVA?
F=MSm/MSE
or F=MSm/MSres (same thing, different label)
What is the equation for the sum of squares Model (SSm)?
Sum of (Y-bar - y-hat-i)2
What is the equation for the sum of squares Residual (SSres or SSe)?
Sum of (yi - y-hat-i)2
What is the equation for the sum of squares total (SSt)?
Sum of (yi - Y-bar)2
What is the degree of freedom for the Residual/Error?
n-2
What is the degree of freedom for the Total?
n-1
How do you determine the MS (Mean Square)?
The sum of squares divided by the degrees of freedom
SS/df
How do you calculated F?
MSmodel/MSresidual
What is the H(0): for the ANOVA table for simple linear regression?
H(0): is that there is no relationship between y and x
Cannot use x to predict y
We can usually expect the slope and intercept to not exactly equal 0, so why do we test these hypothesis anyways?
We test the hypothesis anyways to see if the difference from 0 is significant
What are the 3 null hypothesis of the Regression coefficients?
1) The intercept is no different from 0 intercept = constant 2) The slope is no different from 0 x = slope? (in example weight=slope) 3) There is no relationship between the response and the predictor combination of the other tests?
There are 2 t-tests in the regression coefficients. Why?
Because they test each coefficient and their hypothesis separetly
What is done to test the assumption of linearity?
a scatterplot with a linear regression line and 95% confidence intervals plotted
If the data points are weakly scattered about the regression line then a linear regression may not be appropriate
How do you fix the linearity assumption if the points are weakly scattered about the regression line?
Plot a curvilinear relationship (plateau problem?)
What is done to test the assumption of normality?
A normal Q-Q plot is plotted and if the residuals come from a normal distribution the standard residuals should appear to fall on the line of the plot. If they skew off on the sides, that could indicate tails.
ie. the plot should track the straight line
What is the data ordered like in a Residual Diagnostics plot (4 in 1)?
Ordered smallest to largest and standardized to 0 for itself and the relative variability must be the same (no patterns)