Chapter 3 - Linear Regression Flashcards

Question 1

Q

Simple linear regression (form, coefficients, SE, CI)

Answer

A

yi = beta0 + beta1*xi + epsilon_i. estimates betahat1 and betahat2 are chosen to minimize RSS . you can calculate standard error (SE(betahat1)) and 95% CI for the coefficients (B +- 2 * SE(B)).

Question 2

Q

what is RSS? TSS?

Answer

A

residual sum of squares. (sum from 1 to n of (yi - yhat_i)^2.

total sum of squares. same but ybar instead of yhat

Question 3

Q

Hypothesis test

Answer

A

Ho  = there is no relationship between X and Y (i.e., Beta1 = 0)
Ha = there is some relationship (i.e., Beta1 != 0)

Question 4

Q

Test statistic for hypothesis test

Answer

A

t = betahat_1 / SE(betahat_1). under the null hypothesis this has a t-distribution with n-2 degrees of freedom (be careful about conclusions drawn from this). Pvalue

Question 5

Q

P value

Answer

A

if p value is < threshold, the values gained are too extreme to occur often within the null hypothesis (relies on an assumption of homoscedasticity - assumption of constant standard deviation). this relies very strongly on the model we are using. low p value means statistical significance

Question 6

Q

Multiple Linear Regression

Answer

A

Y = Beta0 + B1*X1 + … + BetapXp + epsilon. Pad X (a data matrix) with an extra column on the left for the intercept.

Question 7

Q

Questions answered by multiple linear regression (4)

Answer

A

1) is at least one of the variables Xi useful for predicting the outcome Y
2) which subset of predictors is most important
3) how good is a linear model for the data
4) given a set of predictors, what is the likely value for Y, and how accurate is that estimate

Question 8

Q

How do we estimate the beta_i s?

Answer

A

minimize RSS, which for multiple linear regression is the HAT vector. Betahat = (X^T*X)^-1 * X^T * y

Question 9

Q

Which variables are important?

Answer

A

consider the hypothesis that the last q coefficients are 0. RSS0 is the RSS for the model which excludes those variables. the F-statistic under the null hypothesis has an f distribution. The null hypothesis is the intercept. WARNING: if number of variables is large, some p values will be arbitrarily low under the null hypothesis.

Question 10

Q

F statistic (relationship to hypothesis test)

Answer

A

[(RSS0 - RSS)/q]/[RSS/(n-q-1)] also (TSS - RSS)/p // RSS/(n-p-1). the t-statistic associated with i-th predictor is the sqrt of the F-statistic for the null hypothesis which sets only Beta_i = 0. Large f-statistic indicates that at least one of the variables is related to the repsonse. (if H0 is true F ~ 1, if Ha is true, F&raquo_space; 1)

Question 11

Q

How many variables are important?

Answer

A

there are many choices for subsets of variables (2^p). select a range of models: 1) forward selection: starting from a null model, include variables one at a time (add in smallest remaining P value). minimize RSS after each step 2) backward selection: start from a full model, remove largest p-value 4) mixed selection: null model, add one at a time, minimize RSS at each step, throw out p values over a threshold

Question 12

Q

tuning

Answer

A

choosing one model in the range produced from a model selection method.

Question 13

Q

Dealing with categorical or qualitative predictors

Answer

A

for each qualitative predictor: 1) choose a baseline category (e.g., african american) 2) for every other category, define a new predictor (Xasian is 1 if person is asian and 0 otherwise) 3) Beta_Asian is the relative effect for being Asian compared to baseline category.

Question 14

Q

Does order matter? Choice of baseline?

Answer

A

When tuning, yes order matters.

model fit and predictions in qualitative predictors is independent of the choice of baseline. However, hypothesis tests derived from these variables are dependent on choice (the effect of being one ethnicity compared to baseline). Solution: to check if ethnicity is important, use an F-test for the hypothesis Beta_Asian = 0 (independent of coding).

Question 15

Q

How good are the predictors?

Answer

A

“confidence intervals” reflect the uncertainty on beta. “prediction” intervals reflect the uncertainty on beta and the irreducible error epsilon.

Question 16

Q

How good is the fit?

Answer

Study These Flashcards

A

focus on the residuals (training points). 3 methods for assessing fit: R^2, RSE, Visualizing data in interesting ways

Question 17

Q

R^2

Answer

Study These Flashcards

A

method for assessing goodness of fit of a linear regression model. R^2 = Corr(Yhat,Y) = 1 - RSS/TSS where TSS is the total sum of squares, increases as you add more variables (training error, starts to overfit)

Question 18

Q

RSE

Answer

Study These Flashcards

A

method for assessing goodness of fit of a linear regression model. RSE = residual standard error, related to RSS, how sure you are on the magnitude of the residuals. RSE = sqrt(RSS/(n - p - 1)) = sqrt(RSS/(n-2))

Chapter 3 - Linear Regression Flashcards

(18 cards)