SRM Chapter 2 Flashcards
SLR
- Simple Linear Regression
- Relationship between two numeric variables
- Parametric
MLR
- Multiple Linear Regression
- Multiple predictors (x’s) used to predict the dependent variable (y).
- Parametric
Residuals
- e = y - y-hat
- For each i
- Want this to be minimized obviously
- This is done by: ordinary least squares tries to minimize the sum of the squared residuals.
Partitioning of Variability
Parameter Estimates
R-squared
- Coefficient of determination
- Portion of variability in the response explained by the predictors
- R-squared = SSR/SST
- Between 0 and 1 (is a %).
- Want this to be high
Adjusted R-Squared
- Adjustment for MLR that accounts for the number of predictors
- Does not have to range from 0 to 1.
B0
- Intercept parameter
- Free parameter
- Equal to y when x is 0.
B1
- Slope parameter
- Free parameter
- For every unit increase in x, y increases by B1 * x.
SLR Model Assumptions (6)
- Yi - B0 + B1Xi + ei
(linear function plus error) - xi’s are non-random
- Expected value of ei is 0.
-> so expected value of Yi is B0 + B1Xi (ei cancels to 0). - Variance of ei is sigma-squared.
-> Because E[ei] = 0, the variance of Yi is also sigma-squared.
-> Also homoscedasticity (variance constant across all observations). - ei’s are independent (across observations (?)).
- ei’s are normally distributed (across observations (?)).
Homoscedasticity
- Variance (sigma-squared) is constant across all observations
b0
- Estimate of B0 to get y-hat
b1
- Estimate of B1 to get y-hat
Method to estimate b0 and b1
- Ordinary least squares/method of least squares
Ordinary Least Squares
- Determines estimates b0, b1
- Optimization equation
- Estimators are unbiased (bias = 0).
MSE
- Mean squared error
- Estimate of sigma-squared
- Denominator is n-2
- Unbiased, so bias is 0.
- Best fit is when MSE is minimized.
RSE
- Residual standard error
- Aka residual standard deviation
- sqrt(MSE)
Design Matrix
- X
- Matrix for the x’s (?)
Hat Matrix
- H
- Aka projection matrix
- H times vector of actual responses = fitted values of response
- In other words, y-hat = H*y
b Matrix
- 1*2 matrix of b0 and b1
y Matrix
- Matrix for actual observed values of y
SSR
- Regression sum of squares
- Proportion of variability in y explained by the predictors
SSE
- Error sum of squares
- Aka sum of squared residuals
- Proportion of variability in y that cannot be explained by the predictors
SST
- Total sum of squares
- Total variability (both explained and unexplained)
- SST = SSR + SSE
Positive Residual
- Actual observation > (larger than) predicted observation
Negative Residual
- Actual observation < (smaller than) predicted observation
Null model
- Y = B0 + e
- No predictors (x’s)
- No relationship between y and x’s
Do you want R-squared and Adjusted R-squared to be high or low?
- High
- Means more of the variance in y can be explained by the predictor(s).
- Want this to be as high as possible so that the unexplained variance is minimized.
Is R-squared or Adjusted R-squared better for comparing MLR models? Why?
- Adjusted R-squared
- Because R-squared increased as predictors are added so a larger R-squared doesn’t necessarily mean a better model.
- But Adjusted R-squared accounts for the number of predictors so it is a better method of comparison between models.
Two-tailed t Test (Hypothesis Test): What are we testing, and why?
- Test to see whether the slope parameter is 0 (B1 = 0).
- H0: B1 = 0
- H1: B1 <> 0
- If true, then there is no relationship between the x’s and y.
- So, we want to reject H0 to say that it’s plausible that there is a linear relationship between x’s and y.
Test Decision (Two-Tailed t Test)
- For significance level a, reject H0 if:
- |t-stat| => ta/2,n-2
- p-value <= a
One-Tailed t Test (Hypothesis Test): What are we testing and why?
- Same as two-tailed but sometimes it’s more appropriate to only have to reject one region
- Looking to prove that there is a positive slope between x and y.
When do we use a right-tailed t test?
- When only a right tail rejection is needed
When do we use a left-tailed t test?
- When only a left tail rejection is needed
Confidence vs Prediction Interval
- Confidence: range for the mean response (across all observations)
- Prediction: range for the response of a new observation
- Prediction > Confidence (prediction is always at least as wide as the confidence interval).
Confidence Interval
- Range that estimates the MEAN response
- Narrows in the middle
Prediction Interval
- Range that estimates a NEW observation’s response
- Narrows in the middle (when the chosen predictor value is also the sample mean of the predictor)
Why is the prediction interval at least as wide as the confidence interval?
- Prediction accounts for the variance in e in addition to Y-hat
- Have to cast a wider net to predict a new single response as opposed to the mean response over all observations
Regression Coefficients
- B0, B1,…,Bp (Bj’s).
- B0 is still the intercept
- B1,…,Bp are regression coefficients instead of slope because that no longer makes sense with multiple predictors (x’s).
Added assumption for MLR
- Predictor xj must not be a linear combination of other p predictors
- Because if an xj is a linear combination of other predictors it doesn’t add any additional information about the relationship between x’s and y.
Nested Models
- Models that share a set of predictors
- Each model is a subset of the next model with more predictors
Nested MLRs: p
- p is a measure of flexibility
MLR: relationship between p and SSE
- p and SSE are inversely related
- As flexibility increases, the amount of unexplained variability decreases
MLR: relationship between SSE and R-squared
- As predictors are added:
- Flexibility (p) increases
- SSE (unexplained variability) decreases as more of the variability becomes explained
- R-squared (ratio of explained variability to total variability) increases as more of the variability becomes explained
Formulas for R-squared
= 1 - SSE/SST
(1 - ratio of unexplained variability to total variability)
= SSR/SST
(ratio of explained variability to total variability)
What is Adjusted R-squared relative to R-squared? (Less/greater than)
- Adjusted R-squared should (almost) always be LESS than R-squared
- Because you can think of Adjusted R-squared as a shrunken version of R-squared that removes inflation from an added number of predictors
- Two cases where Adjusted R-squared is GREATER than R-squared:
- p = 0 (there are no predictors)
- when R-squared = 1 (all of the variance is explained by the predictors, think 100%).
Relationship between correlation coefficient and R-squared for an SLR
Correlation coefficient = sqrt(R-squared)
When should a predictor be dropped from an MLR?
If p-value > significance level, that variable is insignificant and should be dropped.
How should predictors be dropped from an MLR?
Drop variables for which the p-value > the acceptable significance level. Drop one at a time (because p-values may change after a variable is dropped), starting with the highest p-value exceeding the significance level.
How do you find the degrees of freedom for an MLR?
number of observations - (number of predictors + 1)
(Add one to the number of predictors/xi’s because x0).
For an MLR how do you decide whether a coefficient is statistically different from 0?
- Find the degrees of freedom as: # observations - (# predictors + 1)
- Should be given significance level (a) - if the test is two-tailed, divide by 2.
- Find the value on the t-table that corresponds with the df and the significance level.
- Anything that has a t-statistic (absolute value) less than the value on the t-table is not statistically different from 0.
What does the F-test examine?
The significance of all predictors collectively.
The hypothesis being tested (H0) is all of the coefficients = 0. If the p-value is greater than the significance level, then all of the coefficients (Bi’s) are not statistically different from 0 and their respective xi’s should be removed from the model.
R output for p-value of a variable
Pr(>|t|)
What is the hypothesis tested by the F-test?
H0: B1 = … = Bi = 0
If the p-value of the F-test is greater than the significance level, then we fail to reject H0. this means that the Bi’s (coefficients) are not statistically different from 0 and their corresponding xi’s should be removed from the model.
MLR violations/issues (9)
- Misspecified model equation
- Residuals with non-zero averages
- Heteroscedasticity
- Dependent errors
- Non-normal errors
- Multicollinearity
- Outliers
- High leverage points
- High dimensions
Explain the issue/violation of 1. misspecified model equation
Assuming f looks like
Y = B0 + B1x1 + B2x2 + … + Bpxp + e
e.g. if you attempt to fit a linear relationship to something that has a higher-order polynomial relationship
More generally just knowing when linear regression is appropriate or not.
Explain the issue/violation of 2. residuals with non-zero averages
Residuals are how we quantify/approximate the irreducible error.
Since the irreducible error is assumed to have a mean of 0, the residuals should have an average of 0 as well.
If the average of the residuals is far from 0 there is something wrong with the model (this is not a violation but a symptom that points out that there is a violation).
How do you check violation/issue 2. residuals with non-zero averages?
For a bunch of residuals for observations with a similar y-hat, check their averages and they should each be close to 0.
(Note that averaging all of them together won’t produce 0).
Explain the issue/violation of 3. heteroscedasticity
Recall homoscedasticity = e is constant across all observations.
Heteroscedascity is when e is not constant across all observations i.e. there is more than one variance parameter (sigma squared).
Problems:
- Unreliable MSE
- Coefficient estimators (B-hats) don’t have the smallest variance (but they are still unbiased)
Explain the issue/violation of 4. dependent errors
When you wrongly assume e’s are independent across observations:
- Get underestimated se’s
- CI and PI will be narrower
- p-values will be smaller
- May pick wrong/non-optimal regression coefficient estimates (B-hats)
Explain the issue/violation of 5. non-normal errors
If the error terms (e’s) don’t follow a normal distribution, we can’t perform hypothesis tests because we can’t say that estimators follow a t- or F-distribution.
Explain the issue/violation of 6. multicollinearity
When a predictor is or is close to being a linear combination of other predictors.
We get:
- Unstable estimates of regression coefficients (bj’s) (can’t pick the best one that minimizes the SSE).
- This leads to larger se’s so it’s harder to reject H0 for t-tests
It does not affect:
- y-hat
- reliability of MSE
- F-test results
Explain the issue/violation of 7a. outliers
Outlier: observation with extreme residual (y - y-hat, actual - predicted). This inflates the SSE.
Explain the issue/violation of 7b. high leverage points
High leverage point: observation with weird predictor values (x’s) (any one predictor value might be normal but all together they are strange).
Explain the issue/violation of 8. high dimensions
High-dimensional data is when p (number of predictors) is too large. This is relative to n (number of observations).
Linear regression is meant for datasets with n much greater than p.
Issues of high-dimensionality:
- Overfitting
Curse of dimensionality
Quantity of predictors (p) dilutes the quality of data (information becomes sparse) when spread across a small number of observations.
*Note that this only happens with MLRs because SLRs only have one predictor.
High-dimensionality: what happens when n <= p+1?
When the number of observations is lesser than or equal to the number of predictors:
- Overfitting
- The fitted equation will predict the responses perfectly
- No degrees of freedom w/ error
- Unreasonably low SSE
Leverage
How much an observation influences the prediction of the response.
Observation = i
Predictors = x’s
Leverage = hi
Leverage formula
hi = (standard error of y-hat)^2/MSE
Frees text rule of thumb for determining if something is a high leverage point
- If hi > [3(p+1)]/n for an observation.
- Leverage is between 0 and 1 so no absolute value needed.
What issue is happening when an SLR model produces an inverted u-shape for the residual plot?
The model is poor because it is likely missing a key predictor.
- This is because we have a quadratic plot for something that should be linear, so there should probably be a square of an explanatory variable included as a predictor.
- Don’t know if it’s a homoscedasticity violation because there is a clear trend in the residual plot.
Standard error formula
sebj = sqrt(var-hat[Bj])
What plot can we use to tell if the distribution of the residuals is shaped similarly to a normal distribution?
qq plot
How do we completely eliminate multicollinearity?
Use only orthogonal (think perpendicular) predictors. This way we can ensure they are not linear combinations of one another.
What are ways to mitigate multicollinearity? (2)
- Using only orthogonal predictors will completely eliminate multicollinearity.
- Dropping/combining predictors that have high variance inflation (this reduces the possibility of approximate linear relationships btw predictors).
Bounds for leverage (hi)
- Between 1/n and 1
- All hi’s sum to p+1
Cook’s distance
Combines effects of outliers and leverage
When do we consider an observation to be an outlier?
When the standardized residual is greater than 2 or 3 (absolute value).
When do we consider an observation to be a high leverage point?
When its leverage is greater than 3x the average leverage.
When do we consider something to be an influential point?
When Cook’s distance is much larger than 1/n (Cook’s distance can range from 1/n to 1).
How can we handle outliers? (3)
- Include it but add a comment until we can do more data analysis.
- Delete it from the dataset (if it’s incorrect data collection).
- Create a binary variable that indicates whether or not the observation is an outlier (this deals with observations where there isn’t a specific reason for them being outliers).
How can you tell if something is heteroscedastic? What does the graph of residuals vs. fitted values look like?
- Recall heteroscedasticity is when error terms are not constant across all observations. This makes the residuals act strangely because the error term is not the same in the equation: residual = actual - observed.
Examples of what the graph looks like:
- Residuals have a varying spread from 0
- Spread increases with larger fitted values
How can you tell if data is non-normal? What does the graph of residuals vs. fitted values look like?
- Residuals are not evenly distributed or symmetric, just all over the place
- Might be several weirdly large/small residuals indicating a right/left skew
How can we tell if there’s multicollinearity?
What do R-squared and t-stats look like?
- Large R-squared value:
Recall that R-squared tells us how much of the variance in y is explained by the model. So if it predicts super well it might be because there is a linear combination amplifying the effects of predictors -> lead to overfitting? - Small t-statistics:
Recall that t-stat = b-hatj/sebj (estimated coefficient/its standard error).
Also recall that multicollinearity inflates standard error… so the t-stat will be smaller than it should be. - Note: need these two conditions together. This explains that the model does well (high R-squared) but since the t-stats are small it’s harder to reject H0 (a coefficient is statistically different from 0) so we can’t really say if their respective predictors have a relationship with the response variable (y).
Studentless residual
- Residual/estimate of its standard deviation
- Should be realized from t-distribution (regular residual should be realized from normal distribution)
- Unitless (so comparable across diff contexts)
Variance inflation factor for a predictor uncorrelated with all other predictors
1
(Think of inflation factor as a multiplier so since there is no correlation the variance is multiplied by 1 i.e. no effect)
What does a high Breusch-Pagan test indicate?
Heteroscedasticity
(it suggests the variance of errors is not constant across all observations)
Frees text rule of thumb for determining if something is an outlier
If the observation’ standardized residual is greater than 2.
Note that you should use the absolute value.
What is the variance inflation factor (VIF)?
Measure of how much the variance of a regression coefficient is inflated because of multicollinearity.
VIF = 1 means no correlation (remember to think of it as a multiplier)
VIF > 1 means there is correlation, this is a symptom of multicollinearity
VIF > 10 means severe multicollinearity (Frees)
What is a suppressor variable? How does it relate to multicollinearity?
A predictor that increases the importance of another predictor.
If there is multicollinearity you might think that information provided by a variable is ALWAYS redundant because it’s a linear combination of another variable.
This is not the case because a suppressor variable is an exception.
What is the formula for tolerance (think in relation to VIF)?
Tolerance is the reciprocal of VIF
Tolerance = 1/VIF
What is the rule of thumb for severe multicollinearity?
- If VIF is greater than 5 or 10
- Equivalently, if tolerance is less than 0.1 or 0.2
- Recall that tolerance is the reciprocal of VIF
When looking at a graph of x plotted against y for observations, including a line of best fit, how do you tell if something is an outlier? How do you tell if something is a high leverage point?
Outlier: if the observation is far from the line of best fit.
High leverage point: if the x-value of the observation is unlikely (different/far from the other x values). *remember: “unusual in the horizontal direction”
Is the total sum of squares affected by adding/removing variables from the model?
No. The total sum of squares is a function of the observed values -> has nothing to do with the underlying variables. So it remains unchanged.
Units for studentized and standardized residuals
Both are unitless/dimensionless.
Which is better at capturing observations with unusually large residuals? (Standardized or studentized residuals)
Studentized.
For standardized both the e and the MSE will be really large, and since e is in the numerator and the MSE is in the denominator they can cancel each other out.
Leverages are a diagonal element of what?
The hat matrix: X(X-transposeX)^-1X-transpose
What does a good residual plot look like?
Random scatter, no discernible pattern
Parsimony
The idea that a simpler model is preferred over a more complex model that doesn’t substantially improve the simpler model (think doesn’t provide much more information)
How many model equations are there for a model with g predictors?
2^g
Data snooping
Using the same dataset for both developing (training) and evaluating (testing) a model. This can lead to overfitting.
Centered variable
Result of subtracting the sample mean from a variable
Scaled variable
Result of dividing a variable by its unbiased sample sd
Standardized variable
Result of centering then scaling a variable.
1. Start with the variable
2. Subtract the sample mean from it
3. Divide it by its unbiased sample standard deviation
What is ridge regression through a Bayesian lens?
- Posterior mode for B under a GAUSSIAN prior.
- Priori that the coefficients are randomly distributed about 0.
What is lasso regression through a Bayesian lens?
- Posterior mode for B under a DOUBLE-EXPONENTIAL prior.
- Priori that many of the coefficients are exactly 0.