Midterm Flashcards
SSTO
SSTO=SSE+SSR
SSTO=Observed-Mean
Degree of Freedom=n-1
Distance from observed point to mean
𝑆𝑆𝑇=Σ(𝑌𝑖−𝑌̅ )^2
SSTO=Y’(I-1/n J)Y (Quadratic Formula)
SSTO=Y’Y- (1/n)Y’JY
SSE
Error Sum of Squares
SSE=Observed - Predicted
Degree of Freedom=n-p
Simple Linear Degree of freedom is n-2
Distance from observed point to predicted 𝑆𝑆𝐸=Σ(𝑌^𝑖−𝑌𝑖 )^2
SSE=Y’Y-Y’HY
=Y’(I-H)Y
SSR
Residual Sum of Square
SSR=Predicted-mean
Degree of freedom=p-1
Simple Linear Degree of Freedom=1
Distance from Predicted line to mean
𝑆𝑆𝑅=Σ(𝑌^𝑖−𝑌̅ )^2
SSR=Y’HY-(1/n)Y’JY
=Y’(H-(1/n)J)Y
What are the general assumptions of the linear model? How to assess it?
- Linearity: The response and covariates are related in linear way
- Assess with scatter plot of Y vs X and residuals - Normality: Error terms (Ei) or response Yi are normally distributed
-Assess with qq plot and Shapiro Wilk of residuals or semistudentized residuals
-Assess with histogram of residulas
E.g., QQplot shows a trend at the tails away from the diagonal line - Constant Variance: Variance of error is constant. var(Ei)=σ2
- residuals (r)
- absolute value of residuals
- residuals squared (r2)
- semistudentized residuals
- semistudentized squared
- absolute value
- Assess by Scatter Plot (semi-resi vs Y, or squared-resi vs predictors), megaphone shape imply non-constant variance
- Breusch Pagan test (Heteroscedasticity Test (H0=constant variance, Ha=Nonconstant variance)
- Independence: Subjects (Errors or responses) are independent
MSR
SSR/p-1
Variation that is explained by the fitted regression line
MSE
SSE/n-p
MSE is estimate of sigma σ2
√(MSE) is estimate of σ
Variation NOT explained by the fitted regression line
F statistic
F=MSR/MSE
If 𝛽=0→𝐹=1, else 𝛽≠0→𝐹>1
R2
What is R2=0.1885 means
What is R2 adjusted =0.1817 means
Is Adjusted R2 better than R2?
R2=SSR/SSTO
Coefficient of determination
Proportion of variance of Y explained (linearly) by the variation in X
√R2 with the appropriate sign is equal to the correlation coefficient (r) between X and Y (only in simple linear regression
R2=0.18 means, 19% variation in inverse BP is explained by the predictors in the model
R2=0.18 means, 18% variation in inverse BP is explained by the predictors in the model after adjusting for the number of variables in the model
Adjusted R2 is better because it accounts for cost of number of variables
Diagnostics using Raw Residuals
- Linearity
- if linear, residuals form an even band around 0 across the graph of residuals vs. predicted values, and each x from the model. Symmetry around zero, then linear
- Systematic pattern means non-linear, or an important predictor was forgotten. - Normality of the error terms
- qq plot of the residuals, and shapiro wilk - Constant variance
- Look for megaphone in residuals
- increasing/decreasing trend in absolute value of residuals or squared residuals versus the predicted values, or versus any X.
- Can also be seen on scatter plot of Y vs each X.
Limitation of R2
- High R2 does not necessarily imply useful prediction (better to look at width of prediction interval)
- High R2 does not necessarily mean good fit
- R2 close to 0 does not mean no relationship. It only refers to the weak linear relationship
Box Cox Transformation
Method to find a useful transformation
Chose the best λ
Modeling Strategy
- Look at data
- Fit preliminary model
- Perform diagnostics (normality, constant variance, linearity)
- Fix problems by Box Cox Transformation
- Fit new model
- Diagnose new model
- Repeat steps 4-6 until all problems are fixed
- Final inference (p-values, CI, etc)
Boxplot
Boxplot is used to describe the symmetry of the data, can’t directly draw the conclusion of normality from boxplot
Collinearity is tested by
Correlation Matrix
Outlier can be tested by
Plotting semistudentized residual against X or predicted Y
Omission of important predictors can be tested by
Plotting residuals against omitted variables
Non linearity can be fixed by
Transformation
- Transform X if error terms are normally distributed and have constant variance
- Transform Y when unequal error variance and nonnormality of the error terms
- Box Cox Transformation
Non variance consistency can be fixed by
Variance stabilizing transformation
Weighted least squares
Non independence of the error terms can be fixed by
Adding a time covariates to the model
Omission of important predictors can be fixed by
Adding them
Transformation
- Can linearize non-linear relationships
- Can stabilize non-constant variance
- Can reduce non-normality
Interpret: 1/SBP=0.00985-0.0000416*AGE
For every 1 year increase in age, we expect the mean inverse of SBP decrease by 00.0000416 mmHG-1 (while considering other variables holding it constant)
What is the Null Hypothesis and Alternative Hypothesis for F test for multiple variable.
Decision Rule
F critical value
F test equation
H0: 𝛽1=𝛽2=0→
H1: Not all 𝛽s are equal to 0
Decision rule:
If F^≤F(1-α;p-1,n-p) conclude H0
If F^>F(1-α;p-1,n-p) conclude H1
F critical value= F(1-α;p-1,n-p)
F test equation=MSR/MSE
What is null hypothesis for the Spearman correlation?
H0=There is no relationship between two variables
Degree of freedom
dfSSTO=dfSSR+dferror
dfsst=dferror+dfreg
n-1=n-p+p-1
Interpret Breusch-Pagan test results
P=0.3
Null: Constant Variance
Alter: Non-constant Variance
E.g., Age with p value 0.3, we fail to reject the null hypothesis at the 0.05 level and claim there is NOT enough evidence to conclude non-constant variance
E.g., Even though BMI has borderline significant p-value, the overall test suggests constant variance. We claim that no remedial actions are necessary at this point
How to test outlier?
You can test outlier by scatter plotting semi-studentized residual
Any observation with greater than plus or minus 4 semi-studentized residuals suggest outlier.
Write equation Independent variable SYSBP-1 Intercept=0.01276 Dependent variables -BMI =0.00007775 -AGE=-0.00003659 -TOTCHOL=-0.000001463
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)