Midterm Flashcards
SSTO
SSTO=SSE+SSR
SSTO=Observed-Mean
Degree of Freedom=n-1
Distance from observed point to mean
𝑆𝑆𝑇=Σ(𝑌𝑖−𝑌̅ )^2
SSTO=Y’(I-1/n J)Y (Quadratic Formula)
SSTO=Y’Y- (1/n)Y’JY
SSE
Error Sum of Squares
SSE=Observed - Predicted
Degree of Freedom=n-p
Simple Linear Degree of freedom is n-2
Distance from observed point to predicted 𝑆𝑆𝐸=Σ(𝑌^𝑖−𝑌𝑖 )^2
SSE=Y’Y-Y’HY
=Y’(I-H)Y
SSR
Residual Sum of Square
SSR=Predicted-mean
Degree of freedom=p-1
Simple Linear Degree of Freedom=1
Distance from Predicted line to mean
𝑆𝑆𝑅=Σ(𝑌^𝑖−𝑌̅ )^2
SSR=Y’HY-(1/n)Y’JY
=Y’(H-(1/n)J)Y
What are the general assumptions of the linear model? How to assess it?
- Linearity: The response and covariates are related in linear way
- Assess with scatter plot of Y vs X and residuals - Normality: Error terms (Ei) or response Yi are normally distributed
-Assess with qq plot and Shapiro Wilk of residuals or semistudentized residuals
-Assess with histogram of residulas
E.g., QQplot shows a trend at the tails away from the diagonal line - Constant Variance: Variance of error is constant. var(Ei)=σ2
- residuals (r)
- absolute value of residuals
- residuals squared (r2)
- semistudentized residuals
- semistudentized squared
- absolute value
- Assess by Scatter Plot (semi-resi vs Y, or squared-resi vs predictors), megaphone shape imply non-constant variance
- Breusch Pagan test (Heteroscedasticity Test (H0=constant variance, Ha=Nonconstant variance)
- Independence: Subjects (Errors or responses) are independent
MSR
SSR/p-1
Variation that is explained by the fitted regression line
MSE
SSE/n-p
MSE is estimate of sigma σ2
√(MSE) is estimate of σ
Variation NOT explained by the fitted regression line
F statistic
F=MSR/MSE
If 𝛽=0→𝐹=1, else 𝛽≠0→𝐹>1
R2
What is R2=0.1885 means
What is R2 adjusted =0.1817 means
Is Adjusted R2 better than R2?
R2=SSR/SSTO
Coefficient of determination
Proportion of variance of Y explained (linearly) by the variation in X
√R2 with the appropriate sign is equal to the correlation coefficient (r) between X and Y (only in simple linear regression
R2=0.18 means, 19% variation in inverse BP is explained by the predictors in the model
R2=0.18 means, 18% variation in inverse BP is explained by the predictors in the model after adjusting for the number of variables in the model
Adjusted R2 is better because it accounts for cost of number of variables
Diagnostics using Raw Residuals
- Linearity
- if linear, residuals form an even band around 0 across the graph of residuals vs. predicted values, and each x from the model. Symmetry around zero, then linear
- Systematic pattern means non-linear, or an important predictor was forgotten. - Normality of the error terms
- qq plot of the residuals, and shapiro wilk - Constant variance
- Look for megaphone in residuals
- increasing/decreasing trend in absolute value of residuals or squared residuals versus the predicted values, or versus any X.
- Can also be seen on scatter plot of Y vs each X.
Limitation of R2
- High R2 does not necessarily imply useful prediction (better to look at width of prediction interval)
- High R2 does not necessarily mean good fit
- R2 close to 0 does not mean no relationship. It only refers to the weak linear relationship
Box Cox Transformation
Method to find a useful transformation
Chose the best λ
Modeling Strategy
- Look at data
- Fit preliminary model
- Perform diagnostics (normality, constant variance, linearity)
- Fix problems by Box Cox Transformation
- Fit new model
- Diagnose new model
- Repeat steps 4-6 until all problems are fixed
- Final inference (p-values, CI, etc)
Boxplot
Boxplot is used to describe the symmetry of the data, can’t directly draw the conclusion of normality from boxplot
Collinearity is tested by
Correlation Matrix
Outlier can be tested by
Plotting semistudentized residual against X or predicted Y
Omission of important predictors can be tested by
Plotting residuals against omitted variables
Non linearity can be fixed by
Transformation
- Transform X if error terms are normally distributed and have constant variance
- Transform Y when unequal error variance and nonnormality of the error terms
- Box Cox Transformation
Non variance consistency can be fixed by
Variance stabilizing transformation
Weighted least squares
Non independence of the error terms can be fixed by
Adding a time covariates to the model
Omission of important predictors can be fixed by
Adding them
Transformation
- Can linearize non-linear relationships
- Can stabilize non-constant variance
- Can reduce non-normality
Interpret: 1/SBP=0.00985-0.0000416*AGE
For every 1 year increase in age, we expect the mean inverse of SBP decrease by 00.0000416 mmHG-1 (while considering other variables holding it constant)
What is the Null Hypothesis and Alternative Hypothesis for F test for multiple variable.
Decision Rule
F critical value
F test equation
H0: 𝛽1=𝛽2=0→
H1: Not all 𝛽s are equal to 0
Decision rule:
If F^≤F(1-α;p-1,n-p) conclude H0
If F^>F(1-α;p-1,n-p) conclude H1
F critical value= F(1-α;p-1,n-p)
F test equation=MSR/MSE
What is null hypothesis for the Spearman correlation?
H0=There is no relationship between two variables
Degree of freedom
dfSSTO=dfSSR+dferror
dfsst=dferror+dfreg
n-1=n-p+p-1
Interpret Breusch-Pagan test results
P=0.3
Null: Constant Variance
Alter: Non-constant Variance
E.g., Age with p value 0.3, we fail to reject the null hypothesis at the 0.05 level and claim there is NOT enough evidence to conclude non-constant variance
E.g., Even though BMI has borderline significant p-value, the overall test suggests constant variance. We claim that no remedial actions are necessary at this point
How to test outlier?
You can test outlier by scatter plotting semi-studentized residual
Any observation with greater than plus or minus 4 semi-studentized residuals suggest outlier.
Write equation Independent variable SYSBP-1 Intercept=0.01276 Dependent variables -BMI =0.00007775 -AGE=-0.00003659 -TOTCHOL=-0.000001463
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)
Interpret Shapiro-Wilk test results
p=0.624
Null: Residuals are normal
Alt: Residuals are NOT normal
At the 0.05 level, we fail to reject the null hypothesis and claim there NOT enough evidence to conclude the residuals are NOT normal
What is the decision rule for F(0.95,3,336)=2.631
Decision Rule
If F≤2.63149, conclude H0
If F>2.63149 conclude HA (One predictor: There is a linear relationship, Multiple predictors=At least one of the coefficient is non-zero)
Does β0 have an interpretation in this model? Why or why not (no more than 2 sentence answer).
No, BMI, age, total cholesterol of 0 do not make any clinical sense and is biologically impossible.
The range of data does not include 0 for any of the independent variables, so we cannot draw inference there
How to report the significance of EACH coefficient for each predictor in the model?
- State statistic
- State hypotheses
- interpret the results (Age, test statistics -5.60, p<0.0001)
-Statistic = t test
-Hypotheses
H0=B1=0 (Parameter is zero)
HA=B1=/ 0 (Parameter is NOT equal to zero)
- We reject the null hypothesis that the coefficient of age is significantly different from 0
Interpret BMI coefficient
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)
For every 1 unit increase in BMI, the average inverse systolic blood pressure decrease by 0.00007775 mmHg^(-1) while considering age and serum cholesterol total and holding them constant.
True or False
Boxplot is designed to assess Normality
False:
Boxplot is designed for examining quantiles, and describe symmetry of the data so that it doesn’t reveal whole information of distribution.
Interpret Age coefficient
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)
For every 1 year increase in Age, the average inverse systolic blood decrease by 0.00003659 mmHG^-1 while considering BMI and serum cholesterol total and holding them constant
Interpret TOTCHOL coefficient
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)
For every 1 unit increase in serum cholesterol total, average inverse systolic blood will decrease by 0.000001463 mmHG-1 while considering BMI and Age and holding them constant.
Interpret 90% confidence interval of BMI (-0.00010325 -0.00005225)
We are 90% confident that the true parameter of BMI falls between -0.00010325 -0.00005225
Interpret 95% confidence interval of AGE (-0.00004738 -0.00002581)
We are 95% confident that the true parameter of Age falls between -0.00004738 -0.00002581
Predict the average expected inverse systolic blood pressure of a person with BMI 25, 54 years old, 200 cholesterol level, and 95% confidence interval. And interprete the results
SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)
95% confidence interval
Y ̂h±t(1-α⁄2;n-p) s{Y ̂h }
S(Y ̂h )=MSE(Xh^’ (X^’ X)^(-1) Xh )=Xh^’ s^2 {b} Xh
SYSBP^(-1)=0.01276-0.00007775(25)-0.00003659(54)-0.000001463(200)
=0.007914
The predicted average expected INVERSE systolic blood pressure is 0.007914 with a 95% prediction interval of (0.007362; 0.000.008092)mmHg-1
Steps in diagnostics
- Visualize your data using Scatter plot, Histogram
2.
ei
ei=Yi-Y^i
ei is best estimate of Ei
Least square method
Fine the line through the data that has the smallest sum of squared perpendicular distance from the line to each point
Semi-studentized residual
residual (ei)√(MSE)
MSE is in the table
MSE is a good estimator of ?
σ2
What are the 6 problems to diagnose?
- Non linear regression function
- Non constant variance of the error term (ei)
- Non-independence of error terms (Ei)
- Identify outliers
- Non-normality of errors (Ei)
- Omission of other important predictors from the model
Histogram
Check normal distribution
BUT sample size can affect
qqplot
plot the quantile, if data line on equator line, it means residuals have normal distribution
Linearity
Check via scatterplot
Plot residual vs predicted
or
Plot residual vs xi
Which one is better to plot?
Plot residual vs Xi
or
Plot residual vs Y^ (predicted value)
Plot residual vs Y^ (predicted value) is preferred for multiple regression
Residual can use to check what?
Non-constant variance
-No pattern (megaphone shape)
Check for outlier (plot semi resid vs predicted Yi^) (Outliers are >4 or
Why is it better to check normality last?
It can be affected by many things such as non-constant variance
Non constant variance of the error term (Ei)
ei vs x
e2i vs x (it magnify the relationship, it is preferred)
Ieil vs x
Non-independence of error terms
ei vs predicted
Identify outliers
semi-stud residuals vs predicted
Breusch Pagan test
- Large sample test
- Assumes the error terms Ei are independent
- Assume the error terms Ei are normally distributed
When you identify Non-linearity, then?
- Adding nonlinear terms (like X2)
* Use a transform of Y and/or X
When you identify non-constancy of the error variance, then?
- Variance stabilizing transformation
- Weighted least squares
When you found non independence of error terms, then?
-Adding time covariate to the model
When you found important predictors are omitted, then?
-Add them
When you identify outlier, then?
- Check if they are errors in the data and correct them
- Robust linear regression methods
Transformation can:
- Can linearize non-linear relationships
- Can stabilize non-constant variance
- Can reduce non-normality
Transform X or Y?
- Start transform Y
- If you see non-linear trend (quadratic, logarithmic) in scatter plot, add terms like X2 or log(X) in the model form start
Box Cox Transformation
Used to find best transformation from the family of power transformations on Y
Chooses the best ramda from the data
LOWESS
LOWESS (locally weighted regression scatter plot smoother) is non-parametric regression curve that can check linearity.
-If LOWESS curve is inside of confidence bands, then linearity is satisfied
Smoothing technique, Non-parametric regression curves can help:
-Fit the data with a smooth curve
-Does not assume the shape of the curve
Write the Linear, First order regression model
Yi=B0+B1Xi1+Bp-1Xip-1+Ei
What is the Matrix format?
Y=XB+E
E(Y)=XB
Var (Y)=σ2I
What is the most mathematically right way to write model?
E(Y)=XB
How to write coefficient estimate using Matrix
b=(X’X)-1X’Y
How to write variance estimate using Matrix
var(b)=σ2(X’X)-1
Short Multivariate Normal Density Function
Y~MVN (M, ∑)
Y=B0+B1X1+B2X2+E
- How many paramters in this equation?
- B0?
- B1
- B2
- 3 paramters
- Only meaningful if X1 and X2 include 0.
It means the mean response, E(Y) at X1=0 and X2=0 - B1 is change in the mean response, E(Y) per unit increase in X1 while holding X2 constant
- B2 is the change in the mean response, E(Y) per unit increase in X2 while holding X1 constant
Root MSE
Estimate for σ