SRM Chapter 2 Flashcards

1
Q

SLR

A
  • Simple Linear Regression
  • Relationship between two numeric variables
  • Parametric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

MLR

A
  • Multiple Linear Regression
  • Multiple predictors (x’s) used to predict the dependent variable (y).
  • Parametric
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Residuals

A
  • e = y - y-hat
  • For each i
  • Want this to be minimized obviously
  • This is done by: ordinary least squares tries to minimize the sum of the squared residuals.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Partitioning of Variability

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Parameter Estimates

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

R-squared

A
  • Coefficient of determination
  • Portion of variability in the response explained by the predictors
  • R-squared = SSR/SST
  • Between 0 and 1 (is a %).
  • Want this to be high
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Adjusted R-Squared

A
  • Adjustment for MLR that accounts for the number of predictors
  • Does not have to range from 0 to 1.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

B0

A
  • Intercept parameter
  • Free parameter
  • Equal to y when x is 0.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

B1

A
  • Slope parameter
  • Free parameter
  • For every unit increase in x, y increases by B1 * x.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

SLR Model Assumptions (6)

A
  1. Yi - B0 + B1Xi + ei
    (linear function plus error)
  2. xi’s are non-random
  3. Expected value of ei is 0.
    -> so expected value of Yi is B0 + B1Xi (ei cancels to 0).
  4. Variance of ei is sigma-squared.
    -> Because E[ei] = 0, the variance of Yi is also sigma-squared.
    -> Also homoscedasticity (variance constant across all observations).
  5. ei’s are independent (across observations (?)).
  6. ei’s are normally distributed (across observations (?)).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Homoscedasticity

A
  • Variance (sigma-squared) is constant across all observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

b0

A
  • Estimate of B0 to get y-hat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

b1

A
  • Estimate of B1 to get y-hat
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Method to estimate b0 and b1

A
  • Ordinary least squares/method of least squares
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Ordinary Least Squares

A
  • Determines estimates b0, b1
  • Optimization equation
  • Estimators are unbiased (bias = 0).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

MSE

A
  • Mean squared error
  • Estimate of sigma-squared
  • Denominator is n-2
  • Unbiased, so bias is 0.
  • Best fit is when MSE is minimized.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

RSE

A
  • Residual standard error
  • Aka residual standard deviation
  • sqrt(MSE)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Design Matrix

A
  • X
  • Matrix for the x’s (?)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Hat Matrix

A
  • H
  • Aka projection matrix
  • H times vector of actual responses = fitted values of response
  • In other words, y-hat = H*y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

b Matrix

A
  • 1*2 matrix of b0 and b1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

y Matrix

A
  • Matrix for actual observed values of y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

SSR

A
  • Regression sum of squares
  • Proportion of variability in y explained by the predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

SSE

A
  • Error sum of squares
  • Aka sum of squared residuals
  • Proportion of variability in y that cannot be explained by the predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SST

A
  • Total sum of squares
  • Total variability (both explained and unexplained)
  • SST = SSR + SSE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Positive Residual

A
  • Actual observation > (larger than) predicted observation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Negative Residual

A
  • Actual observation < (smaller than) predicted observation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Null model

A
  • Y = B0 + e
  • No predictors (x’s)
  • No relationship between y and x’s
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Do you want R-squared and Adjusted R-squared to be high or low?

A
  • High
  • Means more of the variance in y can be explained by the predictor(s).
  • Want this to be as high as possible so that the unexplained variance is minimized.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Is R-squared or Adjusted R-squared better for comparing MLR models? Why?

A
  • Adjusted R-squared
  • Because R-squared increased as predictors are added so a larger R-squared doesn’t necessarily mean a better model.
  • But Adjusted R-squared accounts for the number of predictors so it is a better method of comparison between models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Two-tailed t Test (Hypothesis Test): What are we testing, and why?

A
  • Test to see whether the slope parameter is 0 (B1 = 0).
  • H0: B1 = 0
  • H1: B1 <> 0
  • If true, then there is no relationship between the x’s and y.
  • So, we want to reject H0 to say that it’s plausible that there is a linear relationship between x’s and y.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Test Decision (Two-Tailed t Test)

A
  • For significance level a, reject H0 if:
  • |t-stat| => ta/2,n-2
  • p-value <= a
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

One-Tailed t Test (Hypothesis Test): What are we testing and why?

A
  • Same as two-tailed but sometimes it’s more appropriate to only have to reject one region
  • Looking to prove that there is a positive slope between x and y.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

When do we use a right-tailed t test?

A
  • When only a right tail rejection is needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

When do we use a left-tailed t test?

A
  • When only a left tail rejection is needed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Confidence vs Prediction Interval

A
  • Confidence: range for the mean response (across all observations)
  • Prediction: range for the response of a new observation
  • Prediction > Confidence (prediction is always at least as wide as the confidence interval).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Confidence Interval

A
  • Range that estimates the MEAN response
  • Narrows in the middle
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Prediction Interval

A
  • Range that estimates a NEW observation’s response
  • Narrows in the middle (when the chosen predictor value is also the sample mean of the predictor)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Why is the prediction interval at least as wide as the confidence interval?

A
  • Prediction accounts for the variance in e in addition to Y-hat
  • Have to cast a wider net to predict a new single response as opposed to the mean response over all observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Regression Coefficients

A
  • B0, B1,…,Bp (Bj’s).
  • B0 is still the intercept
  • B1,…,Bp are regression coefficients instead of slope because that no longer makes sense with multiple predictors (x’s).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Added assumption for MLR

A
  • Predictor xj must not be a linear combination of other p predictors
  • Because if an xj is a linear combination of other predictors it doesn’t add any additional information about the relationship between x’s and y.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Nested Models

A
  • Models that share a set of predictors
  • Each model is a subset of the next model with more predictors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Nested MLRs: p

A
  • p is a measure of flexibility
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

MLR: relationship between p and SSE

A
  • p and SSE are inversely related
  • As flexibility increases, the amount of unexplained variability decreases
44
Q

MLR: relationship between SSE and R-squared

A
  • As predictors are added:
  • Flexibility (p) increases
  • SSE (unexplained variability) decreases as more of the variability becomes explained
  • R-squared (ratio of explained variability to total variability) increases as more of the variability becomes explained
45
Q

Formulas for R-squared

A

= 1 - SSE/SST
(1 - ratio of unexplained variability to total variability)

= SSR/SST
(ratio of explained variability to total variability)

46
Q

What is Adjusted R-squared relative to R-squared? (Less/greater than)

A
  • Adjusted R-squared should (almost) always be LESS than R-squared
  • Because you can think of Adjusted R-squared as a shrunken version of R-squared that removes inflation from an added number of predictors
  • Two cases where Adjusted R-squared is GREATER than R-squared:
  • p = 0 (there are no predictors)
  • when R-squared = 1 (all of the variance is explained by the predictors, think 100%).
47
Q

Relationship between correlation coefficient and R-squared for an SLR

A

Correlation coefficient = sqrt(R-squared)

48
Q

When should a predictor be dropped from an MLR?

A

If p-value > significance level, that variable is insignificant and should be dropped.

49
Q

How should predictors be dropped from an MLR?

A

Drop variables for which the p-value > the acceptable significance level. Drop one at a time (because p-values may change after a variable is dropped), starting with the highest p-value exceeding the significance level.

50
Q

How do you find the degrees of freedom for an MLR?

A

number of observations - (number of predictors + 1)

(Add one to the number of predictors/xi’s because x0).

51
Q

For an MLR how do you decide whether a coefficient is statistically different from 0?

A
  1. Find the degrees of freedom as: # observations - (# predictors + 1)
  2. Should be given significance level (a) - if the test is two-tailed, divide by 2.
  3. Find the value on the t-table that corresponds with the df and the significance level.
  4. Anything that has a t-statistic (absolute value) less than the value on the t-table is not statistically different from 0.
52
Q

What does the F-test examine?

A

The significance of all predictors collectively.

The hypothesis being tested (H0) is all of the coefficients = 0. If the p-value is greater than the significance level, then all of the coefficients (Bi’s) are not statistically different from 0 and their respective xi’s should be removed from the model.

53
Q

R output for p-value of a variable

A

Pr(>|t|)

54
Q

What is the hypothesis tested by the F-test?

A

H0: B1 = … = Bi = 0

If the p-value of the F-test is greater than the significance level, then we fail to reject H0. this means that the Bi’s (coefficients) are not statistically different from 0 and their corresponding xi’s should be removed from the model.

55
Q

MLR violations/issues (9)

A
  1. Misspecified model equation
  2. Residuals with non-zero averages
  3. Heteroscedasticity
  4. Dependent errors
  5. Non-normal errors
  6. Multicollinearity
  7. Outliers
  8. High leverage points
  9. High dimensions
56
Q

Explain the issue/violation of 1. misspecified model equation

A

Assuming f looks like
Y = B0 + B1x1 + B2x2 + … + Bpxp + e

e.g. if you attempt to fit a linear relationship to something that has a higher-order polynomial relationship

More generally just knowing when linear regression is appropriate or not.

57
Q

Explain the issue/violation of 2. residuals with non-zero averages

A

Residuals are how we quantify/approximate the irreducible error.

Since the irreducible error is assumed to have a mean of 0, the residuals should have an average of 0 as well.

If the average of the residuals is far from 0 there is something wrong with the model (this is not a violation but a symptom that points out that there is a violation).

58
Q

How do you check violation/issue 2. residuals with non-zero averages?

A

For a bunch of residuals for observations with a similar y-hat, check their averages and they should each be close to 0.

(Note that averaging all of them together won’t produce 0).

59
Q

Explain the issue/violation of 3. heteroscedasticity

A

Recall homoscedasticity = e is constant across all observations.

Heteroscedascity is when e is not constant across all observations i.e. there is more than one variance parameter (sigma squared).

Problems:
- Unreliable MSE
- Coefficient estimators (B-hats) don’t have the smallest variance (but they are still unbiased)

60
Q

Explain the issue/violation of 4. dependent errors

A

When you wrongly assume e’s are independent across observations:
- Get underestimated se’s
- CI and PI will be narrower
- p-values will be smaller
- May pick wrong/non-optimal regression coefficient estimates (B-hats)

61
Q

Explain the issue/violation of 5. non-normal errors

A

If the error terms (e’s) don’t follow a normal distribution, we can’t perform hypothesis tests because we can’t say that estimators follow a t- or F-distribution.

62
Q

Explain the issue/violation of 6. multicollinearity

A

When a predictor is or is close to being a linear combination of other predictors.

We get:
- Unstable estimates of regression coefficients (bj’s) (can’t pick the best one that minimizes the SSE).
- This leads to larger se’s so it’s harder to reject H0 for t-tests

It does not affect:
- y-hat
- reliability of MSE
- F-test results

63
Q

Explain the issue/violation of 7a. outliers

A

Outlier: observation with extreme residual (y - y-hat, actual - predicted). This inflates the SSE.

64
Q

Explain the issue/violation of 7b. high leverage points

A

High leverage point: observation with weird predictor values (x’s) (any one predictor value might be normal but all together they are strange).

65
Q

Explain the issue/violation of 8. high dimensions

A

High-dimensional data is when p (number of predictors) is too large. This is relative to n (number of observations).

Linear regression is meant for datasets with n much greater than p.

Issues of high-dimensionality:
- Overfitting

66
Q

Curse of dimensionality

A

Quantity of predictors (p) dilutes the quality of data (information becomes sparse) when spread across a small number of observations.

*Note that this only happens with MLRs because SLRs only have one predictor.

67
Q

High-dimensionality: what happens when n <= p+1?

A

When the number of observations is lesser than or equal to the number of predictors:

  • Overfitting
  • The fitted equation will predict the responses perfectly
  • No degrees of freedom w/ error
  • Unreasonably low SSE
68
Q

Leverage

A

How much an observation influences the prediction of the response.

Observation = i
Predictors = x’s
Leverage = hi

69
Q

Leverage formula

A

hi = (standard error of y-hat)^2/MSE

70
Q

Frees text rule of thumb for determining if something is a high leverage point

A
  • If hi > [3(p+1)]/n for an observation.
  • Leverage is between 0 and 1 so no absolute value needed.
70
Q

What issue is happening when an SLR model produces an inverted u-shape for the residual plot?

A

The model is poor because it is likely missing a key predictor.

  • This is because we have a quadratic plot for something that should be linear, so there should probably be a square of an explanatory variable included as a predictor.
  • Don’t know if it’s a homoscedasticity violation because there is a clear trend in the residual plot.
71
Q

Standard error formula

A

sebj = sqrt(var-hat[Bj])

72
Q

What plot can we use to tell if the distribution of the residuals is shaped similarly to a normal distribution?

A

qq plot

73
Q

How do we completely eliminate multicollinearity?

A

Use only orthogonal (think perpendicular) predictors. This way we can ensure they are not linear combinations of one another.

74
Q

What are ways to mitigate multicollinearity? (2)

A
  • Using only orthogonal predictors will completely eliminate multicollinearity.
  • Dropping/combining predictors that have high variance inflation (this reduces the possibility of approximate linear relationships btw predictors).
75
Q

Bounds for leverage (hi)

A
  • Between 1/n and 1
  • All hi’s sum to p+1
76
Q

Cook’s distance

A

Combines effects of outliers and leverage

77
Q

When do we consider an observation to be an outlier?

A

When the standardized residual is greater than 2 or 3 (absolute value).

78
Q

When do we consider an observation to be a high leverage point?

A

When its leverage is greater than 3x the average leverage.

79
Q

When do we consider something to be an influential point?

A

When Cook’s distance is much larger than 1/n (Cook’s distance can range from 1/n to 1).

80
Q

How can we handle outliers? (3)

A
  1. Include it but add a comment until we can do more data analysis.
  2. Delete it from the dataset (if it’s incorrect data collection).
  3. Create a binary variable that indicates whether or not the observation is an outlier (this deals with observations where there isn’t a specific reason for them being outliers).
81
Q

How can you tell if something is heteroscedastic? What does the graph of residuals vs. fitted values look like?

A
  • Recall heteroscedasticity is when error terms are not constant across all observations. This makes the residuals act strangely because the error term is not the same in the equation: residual = actual - observed.

Examples of what the graph looks like:
- Residuals have a varying spread from 0
- Spread increases with larger fitted values

82
Q

How can you tell if data is non-normal? What does the graph of residuals vs. fitted values look like?

A
  • Residuals are not evenly distributed or symmetric, just all over the place
  • Might be several weirdly large/small residuals indicating a right/left skew
83
Q

How can we tell if there’s multicollinearity?
What do R-squared and t-stats look like?

A
  • Large R-squared value:
    Recall that R-squared tells us how much of the variance in y is explained by the model. So if it predicts super well it might be because there is a linear combination amplifying the effects of predictors -> lead to overfitting?
  • Small t-statistics:
    Recall that t-stat = b-hatj/sebj (estimated coefficient/its standard error).
    Also recall that multicollinearity inflates standard error… so the t-stat will be smaller than it should be.
  • Note: need these two conditions together. This explains that the model does well (high R-squared) but since the t-stats are small it’s harder to reject H0 (a coefficient is statistically different from 0) so we can’t really say if their respective predictors have a relationship with the response variable (y).
84
Q

Studentless residual

A
  • Residual/estimate of its standard deviation
  • Should be realized from t-distribution (regular residual should be realized from normal distribution)
  • Unitless (so comparable across diff contexts)
85
Q

Variance inflation factor for a predictor uncorrelated with all other predictors

A

1
(Think of inflation factor as a multiplier so since there is no correlation the variance is multiplied by 1 i.e. no effect)

86
Q

What does a high Breusch-Pagan test indicate?

A

Heteroscedasticity
(it suggests the variance of errors is not constant across all observations)

87
Q

Frees text rule of thumb for determining if something is an outlier

A

If the observation’ standardized residual is greater than 2.

Note that you should use the absolute value.

88
Q

What is the variance inflation factor (VIF)?

A

Measure of how much the variance of a regression coefficient is inflated because of multicollinearity.

VIF = 1 means no correlation (remember to think of it as a multiplier)

VIF > 1 means there is correlation, this is a symptom of multicollinearity

VIF > 10 means severe multicollinearity (Frees)

89
Q

What is a suppressor variable? How does it relate to multicollinearity?

A

A predictor that increases the importance of another predictor.

If there is multicollinearity you might think that information provided by a variable is ALWAYS redundant because it’s a linear combination of another variable.

This is not the case because a suppressor variable is an exception.

90
Q

What is the formula for tolerance (think in relation to VIF)?

A

Tolerance is the reciprocal of VIF
Tolerance = 1/VIF

91
Q

What is the rule of thumb for severe multicollinearity?

A
  • If VIF is greater than 5 or 10
  • Equivalently, if tolerance is less than 0.1 or 0.2
  • Recall that tolerance is the reciprocal of VIF
92
Q

When looking at a graph of x plotted against y for observations, including a line of best fit, how do you tell if something is an outlier? How do you tell if something is a high leverage point?

A

Outlier: if the observation is far from the line of best fit.

High leverage point: if the x-value of the observation is unlikely (different/far from the other x values). *remember: “unusual in the horizontal direction”

93
Q

Is the total sum of squares affected by adding/removing variables from the model?

A

No. The total sum of squares is a function of the observed values -> has nothing to do with the underlying variables. So it remains unchanged.

94
Q

Units for studentized and standardized residuals

A

Both are unitless/dimensionless.

95
Q

Which is better at capturing observations with unusually large residuals? (Standardized or studentized residuals)

A

Studentized.

For standardized both the e and the MSE will be really large, and since e is in the numerator and the MSE is in the denominator they can cancel each other out.

96
Q

Leverages are a diagonal element of what?

A

The hat matrix: X(X-transposeX)^-1X-transpose

97
Q

What does a good residual plot look like?

A

Random scatter, no discernible pattern

98
Q

Parsimony

A

The idea that a simpler model is preferred over a more complex model that doesn’t substantially improve the simpler model (think doesn’t provide much more information)

99
Q

How many model equations are there for a model with g predictors?

A

2^g

100
Q

Data snooping

A

Using the same dataset for both developing (training) and evaluating (testing) a model. This can lead to overfitting.

101
Q

Centered variable

A

Result of subtracting the sample mean from a variable

102
Q

Scaled variable

A

Result of dividing a variable by its unbiased sample sd

103
Q

Standardized variable

A

Result of centering then scaling a variable.
1. Start with the variable
2. Subtract the sample mean from it
3. Divide it by its unbiased sample standard deviation

104
Q

What is ridge regression through a Bayesian lens?

A
  • Posterior mode for B under a GAUSSIAN prior.
  • Priori that the coefficients are randomly distributed about 0.
105
Q

What is lasso regression through a Bayesian lens?

A
  • Posterior mode for B under a DOUBLE-EXPONENTIAL prior.
  • Priori that many of the coefficients are exactly 0.