Chapter 3 - Linear Regression Flashcards
Simple linear regression (form, coefficients, SE, CI)
yi = beta0 + beta1*xi + epsilon_i. estimates betahat1 and betahat2 are chosen to minimize RSS . you can calculate standard error (SE(betahat1)) and 95% CI for the coefficients (B +- 2 * SE(B)).
what is RSS? TSS?
residual sum of squares. (sum from 1 to n of (yi - yhat_i)^2.
total sum of squares. same but ybar instead of yhat
Hypothesis test
Ho = there is no relationship between X and Y (i.e., Beta1 = 0) Ha = there is some relationship (i.e., Beta1 != 0)
Test statistic for hypothesis test
t = betahat_1 / SE(betahat_1). under the null hypothesis this has a t-distribution with n-2 degrees of freedom (be careful about conclusions drawn from this). Pvalue
P value
if p value is < threshold, the values gained are too extreme to occur often within the null hypothesis (relies on an assumption of homoscedasticity - assumption of constant standard deviation). this relies very strongly on the model we are using. low p value means statistical significance
Multiple Linear Regression
Y = Beta0 + B1*X1 + … + BetapXp + epsilon. Pad X (a data matrix) with an extra column on the left for the intercept.
Questions answered by multiple linear regression (4)
1) is at least one of the variables Xi useful for predicting the outcome Y
2) which subset of predictors is most important
3) how good is a linear model for the data
4) given a set of predictors, what is the likely value for Y, and how accurate is that estimate
How do we estimate the beta_i s?
minimize RSS, which for multiple linear regression is the HAT vector. Betahat = (X^T*X)^-1 * X^T * y
Which variables are important?
consider the hypothesis that the last q coefficients are 0. RSS0 is the RSS for the model which excludes those variables. the F-statistic under the null hypothesis has an f distribution. The null hypothesis is the intercept. WARNING: if number of variables is large, some p values will be arbitrarily low under the null hypothesis.
F statistic (relationship to hypothesis test)
[(RSS0 - RSS)/q]/[RSS/(n-q-1)] also (TSS - RSS)/p // RSS/(n-p-1). the t-statistic associated with i-th predictor is the sqrt of the F-statistic for the null hypothesis which sets only Beta_i = 0. Large f-statistic indicates that at least one of the variables is related to the repsonse. (if H0 is true F ~ 1, if Ha is true, F»_space; 1)
How many variables are important?
there are many choices for subsets of variables (2^p). select a range of models: 1) forward selection: starting from a null model, include variables one at a time (add in smallest remaining P value). minimize RSS after each step 2) backward selection: start from a full model, remove largest p-value 4) mixed selection: null model, add one at a time, minimize RSS at each step, throw out p values over a threshold
tuning
choosing one model in the range produced from a model selection method.
Dealing with categorical or qualitative predictors
for each qualitative predictor: 1) choose a baseline category (e.g., african american) 2) for every other category, define a new predictor (Xasian is 1 if person is asian and 0 otherwise) 3) Beta_Asian is the relative effect for being Asian compared to baseline category.
Does order matter? Choice of baseline?
When tuning, yes order matters.
model fit and predictions in qualitative predictors is independent of the choice of baseline. However, hypothesis tests derived from these variables are dependent on choice (the effect of being one ethnicity compared to baseline). Solution: to check if ethnicity is important, use an F-test for the hypothesis Beta_Asian = 0 (independent of coding).
How good are the predictors?
“confidence intervals” reflect the uncertainty on beta. “prediction” intervals reflect the uncertainty on beta and the irreducible error epsilon.