Midterm Flashcards

1
Q

SSTO

A

SSTO=SSE+SSR
SSTO=Observed-Mean
Degree of Freedom=n-1
Distance from observed point to mean

𝑆𝑆𝑇=Σ(𝑌𝑖−𝑌̅ )^2

SSTO=Y’(I-1/n J)Y (Quadratic Formula)
SSTO=Y’Y- (1/n)Y’JY

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

SSE

A

Error Sum of Squares
SSE=Observed - Predicted

Degree of Freedom=n-p
Simple Linear Degree of freedom is n-2
Distance from observed point to predicted 𝑆𝑆𝐸=Σ(𝑌^𝑖−𝑌𝑖 )^2

SSE=Y’Y-Y’HY
=Y’(I-H)Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

SSR

A

Residual Sum of Square
SSR=Predicted-mean
Degree of freedom=p-1

Simple Linear Degree of Freedom=1
Distance from Predicted line to mean
𝑆𝑆𝑅=Σ(𝑌^𝑖−𝑌̅ )^2

SSR=Y’HY-(1/n)Y’JY
=Y’(H-(1/n)J)Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the general assumptions of the linear model? How to assess it?

A
  1. Linearity: The response and covariates are related in linear way
    - Assess with scatter plot of Y vs X and residuals
  2. Normality: Error terms (Ei) or response Yi are normally distributed
    -Assess with qq plot and Shapiro Wilk of residuals or semistudentized residuals
    -Assess with histogram of residulas
    E.g., QQplot shows a trend at the tails away from the diagonal line
  3. Constant Variance: Variance of error is constant. var(Ei)=σ2
    - residuals (r)
    - absolute value of residuals
    - residuals squared (r2)
    - semistudentized residuals
    - semistudentized squared
    - absolute value
  • Assess by Scatter Plot (semi-resi vs Y, or squared-resi vs predictors), megaphone shape imply non-constant variance
  • Breusch Pagan test (Heteroscedasticity Test (H0=constant variance, Ha=Nonconstant variance)
  1. Independence: Subjects (Errors or responses) are independent
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

MSR

A

SSR/p-1

Variation that is explained by the fitted regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

MSE

A

SSE/n-p
MSE is estimate of sigma σ2
√(MSE) is estimate of σ
Variation NOT explained by the fitted regression line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

F statistic

A

F=MSR/MSE

If 𝛽=0→𝐹=1, else 𝛽≠0→𝐹>1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

R2

What is R2=0.1885 means

What is R2 adjusted =0.1817 means

Is Adjusted R2 better than R2?

A

R2=SSR/SSTO
Coefficient of determination
Proportion of variance of Y explained (linearly) by the variation in X
√R2 with the appropriate sign is equal to the correlation coefficient (r) between X and Y (only in simple linear regression

R2=0.18 means, 19% variation in inverse BP is explained by the predictors in the model

R2=0.18 means, 18% variation in inverse BP is explained by the predictors in the model after adjusting for the number of variables in the model

Adjusted R2 is better because it accounts for cost of number of variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Diagnostics using Raw Residuals

A
  1. Linearity
    - if linear, residuals form an even band around 0 across the graph of residuals vs. predicted values, and each x from the model. Symmetry around zero, then linear
    - Systematic pattern means non-linear, or an important predictor was forgotten.
  2. Normality of the error terms
    - qq plot of the residuals, and shapiro wilk
  3. Constant variance
    - Look for megaphone in residuals
    - increasing/decreasing trend in absolute value of residuals or squared residuals versus the predicted values, or versus any X.
    - Can also be seen on scatter plot of Y vs each X.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Limitation of R2

A
  1. High R2 does not necessarily imply useful prediction (better to look at width of prediction interval)
  2. High R2 does not necessarily mean good fit
  3. R2 close to 0 does not mean no relationship. It only refers to the weak linear relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Box Cox Transformation

A

Method to find a useful transformation

Chose the best λ

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Modeling Strategy

A
  1. Look at data
  2. Fit preliminary model
  3. Perform diagnostics (normality, constant variance, linearity)
  4. Fix problems by Box Cox Transformation
  5. Fit new model
  6. Diagnose new model
  7. Repeat steps 4-6 until all problems are fixed
  8. Final inference (p-values, CI, etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Boxplot

A

Boxplot is used to describe the symmetry of the data, can’t directly draw the conclusion of normality from boxplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Collinearity is tested by

A

Correlation Matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Outlier can be tested by

A

Plotting semistudentized residual against X or predicted Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Omission of important predictors can be tested by

A

Plotting residuals against omitted variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Non linearity can be fixed by

A

Transformation

  • Transform X if error terms are normally distributed and have constant variance
  • Transform Y when unequal error variance and nonnormality of the error terms
  • Box Cox Transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Non variance consistency can be fixed by

A

Variance stabilizing transformation

Weighted least squares

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Non independence of the error terms can be fixed by

A

Adding a time covariates to the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Omission of important predictors can be fixed by

A

Adding them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Transformation

A
  1. Can linearize non-linear relationships
  2. Can stabilize non-constant variance
  3. Can reduce non-normality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Interpret: 1/SBP=0.00985-0.0000416*AGE

A

For every 1 year increase in age, we expect the mean inverse of SBP decrease by 00.0000416 mmHG-1 (while considering other variables holding it constant)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the Null Hypothesis and Alternative Hypothesis for F test for multiple variable.

Decision Rule

F critical value

F test equation

A

H0: 𝛽1=𝛽2=0→
H1: Not all 𝛽s are equal to 0

Decision rule:
If F^≤F(1-α;p-1,n-p) conclude H0
If F^
>F(1-α;p-1,n-p) conclude H1

F critical value= F(1-α;p-1,n-p)

F test equation=MSR/MSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is null hypothesis for the Spearman correlation?

A

H0=There is no relationship between two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Degree of freedom

dfSSTO=dfSSR+dferror

A

dfsst=dferror+dfreg

n-1=n-p+p-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Interpret Breusch-Pagan test results

P=0.3

A

Null: Constant Variance
Alter: Non-constant Variance

E.g., Age with p value 0.3, we fail to reject the null hypothesis at the 0.05 level and claim there is NOT enough evidence to conclude non-constant variance

E.g., Even though BMI has borderline significant p-value, the overall test suggests constant variance. We claim that no remedial actions are necessary at this point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How to test outlier?

A

You can test outlier by scatter plotting semi-studentized residual

Any observation with greater than plus or minus 4 semi-studentized residuals suggest outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
Write equation
Independent variable SYSBP-1
Intercept=0.01276
Dependent variables
-BMI =0.00007775
-AGE=-0.00003659
-TOTCHOL=-0.000001463
A

SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Interpret Shapiro-Wilk test results

p=0.624

A

Null: Residuals are normal
Alt: Residuals are NOT normal

At the 0.05 level, we fail to reject the null hypothesis and claim there NOT enough evidence to conclude the residuals are NOT normal

30
Q

What is the decision rule for F(0.95,3,336)=2.631

A

Decision Rule
If F≤2.63149, conclude H0
If F
>2.63149 conclude HA (One predictor: There is a linear relationship, Multiple predictors=At least one of the coefficient is non-zero)

31
Q

Does β0 have an interpretation in this model? Why or why not (no more than 2 sentence answer).

A

No, BMI, age, total cholesterol of 0 do not make any clinical sense and is biologically impossible.

The range of data does not include 0 for any of the independent variables, so we cannot draw inference there

32
Q

How to report the significance of EACH coefficient for each predictor in the model?

  • State statistic
  • State hypotheses
  • interpret the results (Age, test statistics -5.60, p<0.0001)
A

-Statistic = t test
-Hypotheses
H0=B1=0 (Parameter is zero)
HA=B1=/ 0 (Parameter is NOT equal to zero)

  • We reject the null hypothesis that the coefficient of age is significantly different from 0
33
Q

Interpret BMI coefficient

SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)

A

For every 1 unit increase in BMI, the average inverse systolic blood pressure decrease by 0.00007775 mmHg^(-1) while considering age and serum cholesterol total and holding them constant.

34
Q

True or False

Boxplot is designed to assess Normality

A

False:
Boxplot is designed for examining quantiles, and describe symmetry of the data so that it doesn’t reveal whole information of distribution.

35
Q

Interpret Age coefficient

SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)

A

For every 1 year increase in Age, the average inverse systolic blood decrease by 0.00003659 mmHG^-1 while considering BMI and serum cholesterol total and holding them constant

36
Q

Interpret TOTCHOL coefficient

SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)

A

For every 1 unit increase in serum cholesterol total, average inverse systolic blood will decrease by 0.000001463 mmHG-1 while considering BMI and Age and holding them constant.

37
Q
Interpret 90% confidence interval
of BMI (-0.00010325  -0.00005225)
A

We are 90% confident that the true parameter of BMI falls between -0.00010325 -0.00005225

38
Q
Interpret 95% confidence interval
of AGE (-0.00004738 -0.00002581)
A

We are 95% confident that the true parameter of Age falls between -0.00004738 -0.00002581

39
Q

Predict the average expected inverse systolic blood pressure of a person with BMI 25, 54 years old, 200 cholesterol level, and 95% confidence interval. And interprete the results

SYSBP^(-1)=0.01276-0.00007775(BMI)-0.00003659(AGE)-0.000001463(TOTCHOL)

A

95% confidence interval
Y ̂h±t(1-α⁄2;n-p) s{Y ̂h }
S(Y ̂h )=MSE(Xh^’ (X^’ X)^(-1) Xh )=Xh^’ s^2 {b} Xh

SYSBP^(-1)=0.01276-0.00007775(25)-0.00003659(54)-0.000001463(200)
=0.007914

The predicted average expected INVERSE systolic blood pressure is 0.007914 with a 95% prediction interval of (0.007362; 0.000.008092)mmHg-1

40
Q

Steps in diagnostics

A
  1. Visualize your data using Scatter plot, Histogram

2.

41
Q

ei

A

ei=Yi-Y^i

ei is best estimate of Ei

42
Q

Least square method

A

Fine the line through the data that has the smallest sum of squared perpendicular distance from the line to each point

43
Q

Semi-studentized residual

A

residual (ei)√(MSE)

MSE is in the table

44
Q

MSE is a good estimator of ?

A

σ2

45
Q

What are the 6 problems to diagnose?

A
  1. Non linear regression function
  2. Non constant variance of the error term (ei)
  3. Non-independence of error terms (Ei)
  4. Identify outliers
  5. Non-normality of errors (Ei)
  6. Omission of other important predictors from the model
46
Q

Histogram

A

Check normal distribution

BUT sample size can affect

47
Q

qqplot

A

plot the quantile, if data line on equator line, it means residuals have normal distribution

48
Q

Linearity

A

Check via scatterplot
Plot residual vs predicted

or

Plot residual vs xi

49
Q

Which one is better to plot?
Plot residual vs Xi
or
Plot residual vs Y^ (predicted value)

A

Plot residual vs Y^ (predicted value) is preferred for multiple regression

50
Q

Residual can use to check what?

A

Non-constant variance
-No pattern (megaphone shape)

Check for outlier (plot semi resid vs predicted Yi^) (Outliers are >4 or

51
Q

Why is it better to check normality last?

A

It can be affected by many things such as non-constant variance

52
Q

Non constant variance of the error term (Ei)

A

ei vs x
e2i vs x (it magnify the relationship, it is preferred)
Ieil vs x

53
Q

Non-independence of error terms

A

ei vs predicted

54
Q

Identify outliers

A

semi-stud residuals vs predicted

55
Q

Breusch Pagan test

A
  • Large sample test
  • Assumes the error terms Ei are independent
  • Assume the error terms Ei are normally distributed
56
Q

When you identify Non-linearity, then?

A
  • Adding nonlinear terms (like X2)

* Use a transform of Y and/or X

57
Q

When you identify non-constancy of the error variance, then?

A
  • Variance stabilizing transformation

- Weighted least squares

58
Q

When you found non independence of error terms, then?

A

-Adding time covariate to the model

59
Q

When you found important predictors are omitted, then?

A

-Add them

60
Q

When you identify outlier, then?

A
  • Check if they are errors in the data and correct them

- Robust linear regression methods

61
Q

Transformation can:

A
  1. Can linearize non-linear relationships
  2. Can stabilize non-constant variance
  3. Can reduce non-normality
62
Q

Transform X or Y?

A
  1. Start transform Y
  2. If you see non-linear trend (quadratic, logarithmic) in scatter plot, add terms like X2 or log(X) in the model form start
63
Q

Box Cox Transformation

A

Used to find best transformation from the family of power transformations on Y
Chooses the best ramda from the data

64
Q

LOWESS

A

LOWESS (locally weighted regression scatter plot smoother) is non-parametric regression curve that can check linearity.
-If LOWESS curve is inside of confidence bands, then linearity is satisfied
Smoothing technique, Non-parametric regression curves can help:
-Fit the data with a smooth curve
-Does not assume the shape of the curve

65
Q

Write the Linear, First order regression model

A

Yi=B0+B1Xi1+Bp-1Xip-1+Ei

66
Q

What is the Matrix format?

A

Y=XB+E
E(Y)=XB
Var (Y)=σ2I

67
Q

What is the most mathematically right way to write model?

A

E(Y)=XB

68
Q

How to write coefficient estimate using Matrix

A

b=(X’X)-1X’Y

69
Q

How to write variance estimate using Matrix

A

var(b)=σ2(X’X)-1

70
Q

Short Multivariate Normal Density Function

A

Y~MVN (M, ∑)

71
Q

Y=B0+B1X1+B2X2+E

  1. How many paramters in this equation?
  2. B0?
  3. B1
  4. B2
A
  1. 3 paramters
  2. Only meaningful if X1 and X2 include 0.
    It means the mean response, E(Y) at X1=0 and X2=0
  3. B1 is change in the mean response, E(Y) per unit increase in X1 while holding X2 constant
  4. B2 is the change in the mean response, E(Y) per unit increase in X2 while holding X1 constant
72
Q

Root MSE

A

Estimate for σ