Linear Regression Flashcards

1
Q

What does ‘strength of a relationship’ in regression refer to?

A

It is an indication of how well one can predict the response variable (e.g., sales) from the predictor (e.g., advertising budget). A strong relationship implies high predictive accuracy, whereas a weak relationship implies a prediction only slightly better than random guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Definition: Simple Linear Regression

A

A linear model with one predictor (X) used to predict an outcome (Y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the symbol ‘≈’ mean in a regression/statistical context?

A

It can be read as ‘is approximately modeled as,’ indicating an approximate relationship rather than an exact equality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

sales ≈ β0 + β1 × TV

What does β0 represent in this equation?

A

Intercept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Definition: Intercept (β0)

A

Represents the predicted value of Y when X=0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

sales ≈ β0 + β1 × TV

What does β1 represent in this equation?

A

Slope

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Definition: Slope (β1)

A

Represents the average change in Y for a one-unit increase in X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

sales ≈ β0 + β1 × TV

What are terms used to refer to β0 and β1 collectively?

A
  • Coefficients
  • Parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ordinary Least Squares (OLS) Estimation

A

A method to estimate β0 and β1 by minimizing the sum of squared residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Residual (ε)

A

ei = yi −yˆi

The difference between an observed value (Y) and the model’s fitted value (Ŷ).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Residual sum of squares (RSS) equation

Include simple form and full form

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Equation for slope (β1)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Equation for intercept (β0)

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Definition: least squares coefficient estimates

A

They are the intercept and slope estimates chosen to minimize the sum of squared residuals (differences between observed and predicted values), providing the best linear fit to the data under the least squares criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Best-Fit Line

A

The linear function Ŷ = β₀ + β₁X that minimizes the sum of squared residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Interpretation of β1 in Linear Regression

A

Indicates how much Y is expected to change when X increases by one unit, holding other factors constant (if any).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Assumption: Linearity

A

Y is assumed to be linearly related to X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Assumption: Independence of Errors

A

The residuals are assumed to be uncorrelated with one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Assumption: Exogeneity

A

The error term or residuals are independent of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Assumption: Homoscedasticity

A

The variance of residuals is constant across all values of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Assumption: Normality of Errors

A

Residuals are assumed to follow a normal distribution (especially important for inference).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Definition: Population regression line

A

The population regression line is the true (but typically unknown) underlying linear relationship between X and Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Definition: least squares line

A

The least squares line is our estimated linear relationship based on a specific sample of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the distinction between the least squares line and population regression line?

A

Different samples yield slightly different least squares lines, but the population regression line remains fixed (and unobserved).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is an unbiased estimator in statistics?

A

An unbiased estimator is one whose expected value equals the true parameter across many samples, meaning it does not systematically over- or under-estimate the parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Are least squares estimates unbiased?

A

Yes. If we repeatedly draw different samples and compute the least squares estimates, the average of those estimates will equal the true coefficients. Hence they do not systematically over- or under-estimate the true parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the standard error of an estimator?

A

It is a measure of the estimator’s variability—how far the estimator (e.g., a sample mean or a regression coefficient) is likely to deviate from the true parameter value on average.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Formula for the variance of the sample mean μ̂

A

Var(μ̂) = SE(μ̂)² = σ² / n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Var(μ̂) = SE(μ̂)² = σ² / n

What are the conditions under which the variance of the sample mean μ̂ holds?

A

Independent and identically distributed (i.i.d.) with finite variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Var(μ̂) = SE(μ̂)² = σ² / n

What does the variance equation for the sample mean μ̂ tell us?

A

The variability of the sample mean decreases as sample size grows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Formula for Var(β̂₀) in simple linear regression

A

SE(β̂₀)² = σ² * [ 1/n + ( x̄² / Σᵢ (xᵢ - x̄)² ) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Formula for Var(β̂₁) in simple linear regression

A

SE(β̂₁)² = σ² / Σᵢ (xᵢ - x̄)²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

SE(β̂₁)² = σ² / Σᵢ (xᵢ - x̄)²

What does the variance equation for β̂₁ in simple linear regression tell us?

A

SE(β̂₁) is smaller when the xᵢ are more spread out; intuitively we have more leverage to estimate a slope when this is the case

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is a confidence interval?

A

It is a range of values that, with a specified level of confidence (e.g. 95%), is expected to contain the true (but unknown) parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the approximate 95% confidence interval for β₁ in simple linear regression? How is it approximate?

A

β̂₁ ± 2 · SE(β̂₁). (Strictly speaking, we use the t-distribution quantile with n−2 degrees of freedom, but 2 is a close approximation.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is the null hypothesis (H₀) when testing for a relationship between X and Y?

A

H₀: β₁ = 0 (no relationship between X and Y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the alternative hypothesis (Hₐ) when testing for a relationship between X and Y?

A

Hₐ: β₁ ≠ 0 (some relationship between X and Y).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What distribution does the test statistic, (β̂₁ - 0) / SE(β̂₁), follow when testing for a relationship between X and Y?

A

t-distribution with n - 2 degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How do we compute the t-statistic for β₁ when testing for a relationship between X and Y?

A

t = (β̂₁ - 0) / SE(β̂₁), which measures how many standard deviations β̂₁ is away from zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What does the p-value represent in the context of testing for a relationship between X and Y?

A

It is the probability of observing a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A small p-value suggests that β₁ ≠ 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How do we typically decide to reject H₀?

A

If the p-value is below a chosen significance level (e.g., 0.05), we reject H₀ and conclude there is likely a relationship between X and Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Using Linear Models for Inference

A

We can make statements about the relationship between X and Y (e.g., whether β₁ ≠ 0) based on statistical tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Definition: Residual Standard Error (RSE)

A

An estimate of the standard deviation of the error terms in a regression model, measuring how far observed values typically deviate from the true regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Formula: RSE

Simple linear regression

A

RSE = sqrt( (1/(n - 2)) * Σ( yᵢ - ŷᵢ )² ), where n is the number of observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Definition: Total Sum of Squares (TSS)

A

Represents the total variability in the response variable Y before regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Formula: TSS

A

TSS = Σ( yᵢ - ȳ )²

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Include equation

Definition: Residual Sum of Squares (RSS)

A

RSS = Σ( yᵢ - ŷᵢ )². It measures the variability in Y left unexplained by the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Definition: R² Statistic

A

R² measures the proportion of variability in Y that is explained by the model; it always lies between 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Formula: R²

A

R² = 1 - (RSS / TSS). It compares unexplained variability to total variability in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Why might we use R² instead of RSE?

A

R² is a scale-free measure of the proportion of variance in the response explained by the model, always lying between 0 and 1. RSE, in contrast, is on the scale of Y and can be harder to interpret across different contexts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is considered a ‘good’ R²?

A

It depends on the context and field of application. In some physical sciences, values near 1 might be realistic. In many social or biological settings, much lower R² values (e.g. 0.1 or 0.2) may still be considered informative.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Definition: Correlation between X and Y

Include a verbal explanation of how it’s computed

A

A measure of the linear relationship between X and Y, computed as the covariance of X and Y divided by the product of their standard deviations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Formula: Correlation between X and Y

A

Cor(X,Y) = (∑(xᵢ - x̄)(yᵢ - ȳ)) / √[∑(xᵢ - x̄)² * ∑(yᵢ - ȳ)²]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Relationship: R² and correlation in simple linear regression

A

In a simple linear regression with one predictor, R² equals the square of the correlation between X and Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Definition: F-Statistic (for regression)

A

A ratio of explained variance to unexplained variance, used to test whether at least one predictor is significantly related to the response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is Multiple Linear Regression (MLR)?

A

A statistical technique for modeling the relationship between one response (dependent) variable and multiple predictor (independent) variables, using a linear function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is the general MLR model equation?

A

Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε, where Y is the response, Xᵢ are predictors, βᵢ are unknown coefficients, and ε is the error term.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

How are the coefficients in MLR typically estimated?

A

By minimizing the Residual Sum of Squares (RSS) = Σ(yᵢ - ŷᵢ)², where ŷᵢ is the model’s predicted value for observation i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ + ε

What does βⱼ represent in MLR?

A

βⱼ represents the estimated change in the response Y for a one-unit change in Xⱼ, holding all other predictors constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Why might multiple predictors be used instead of just one?

A

Including additional relevant predictors often improves predictions and reveals more nuanced relationships, controlling for the effects of other variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Why might a predictor appear significant when analyzed alone but not in a multiple regression?

A

Because in simple regression we do not control for the effects of other predictors. Once we include additional variables, the apparent significance can disappear if the predictor’s effect was actually due to correlation with those other predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are some important questions we may seek to answer with MLR?

4 questions

A
  1. Is at least one of the predictors X1, X2,…,Xp useful in predicting the response?
  2. Do all the predictors help to explain Y , or is only a subset of the predictors useful?
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Definition: RSE in MLR

A

An estimate of the standard deviation of the error terms, measuring how far observed values typically deviate from the fitted regression hyperplane. It quantifies the average unexplained variability per observation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Formula: RSE in MLR

A

RSE = √[ RSS / (n - p - 1) ], where p is the number of predictors and n is the sample size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Definition: Multiple R²

A

It is the proportion of variability in the response Y explained by the model. It ranges from 0 to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Formula: Multiple R²

A

R² = 1 - (RSS / TSS)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Why might we use Adjusted R² instead of R² in MLR?

A

R² can increase by simply adding more predictors, even if they are only marginally useful. Adjusted R² penalizes for extra predictors, preventing misleadingly high R² values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Formula: Adjusted R²

A

Adjusted R² = 1 - [ (RSS / (n - p - 1)) / (TSS / (n - 1)) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is the overall F-test in MLR?

A

A hypothesis test checking whether at least one of the predictors has a non-zero coefficient. H₀: all βⱼ = 0 vs. Hₐ: at least one βⱼ ≠ 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

F = [ (TSS - RSS)/p ] / [ RSS/(n - p - 1) ]

What does F-statistic in MLR tell you?

A

A large F suggests that the model with predictors explains significantly more variance than a model with no predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Formula: F-statistic in MLR

A

F = [ (TSS - RSS)/p ] / [ RSS/(n - p - 1) ]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What distribution does the F-statistic follow in multiple linear regression?

A

Under the classical assumptions (normal errors, etc.), the F-statistic follows an F-distribution with p and n−p−1 degrees of freedom.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

The F-statistic in multiple linear regression follows an F distribution under the assumption that the errors ε_i have a normal distribution. Does this still hold if the errors are not perfectly normal?

A

Yes, if the sample size n is sufficiently large, the F-statistic is approximately F-distributed due to asymptotic robustness, even if the errors deviate from normality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is a partial F-test in multiple linear regression?

A

It compares a ‘full’ model (with all predictors) to a ‘reduced’ model (omitting a subset of q predictors), determining whether those q predictors significantly improve the fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is the formula for the partial F-statistic?

A

F = [ (RSS₀ - RSS) / q ] / [ RSS / (n - p - 1) ], where RSS₀ is the RSS of the reduced model, RSS is the RSS of the full model, and p is the total number of predictors in the full model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How do we interpret the partial F-test result?

A

A large F (with a small p-value) indicates that dropping the q predictors increases the residual error enough to conclude those predictors matter. If F is near 1, there’s little evidence that the omitted predictors improve the model.

77
Q

What do the individual t-tests check in MLR?

A

They test whether each coefficient βⱼ is significantly different from zero, holding the other predictors constant.

78
Q

What does p-value mean in the context of each predictor’s t-test?

A

It is the probability, under the null hypothesis (βⱼ = 0), of observing a test statistic at least as extreme as the one computed from the data.

79
Q

Why do we look at the overall F-statistic rather than just individual t-tests?

A

Because with many predictors (p large), some t-tests may be significant by chance (false positives). The F-statistic adjusts for the number of predictors, so the probability of incorrectly rejecting H₀ remains at the chosen significance level (e.g. 5%), regardless of how many predictors there are.

80
Q

When can the usual F-statistic not be used in multiple linear regression?

A

If the number of predictors (p) exceeds the number of observations (n), we cannot fit the model using ordinary least squares — there are not enough degrees of freedom — and thus we cannot perform the usual F-test. Specialized high-dimensional methods are required instead.

81
Q

What are the three classical approaches for variable selection in MLR?

A
  1. Forward selection
  2. Backward selection
  3. Mixed selection
82
Q

What is forward selection?

A

A stepwise approach that starts with the null model (only an intercept), then adds one predictor at a time — whichever reduces RSS the most — until a stopping criterion is reached.

83
Q

What is backward selection?

A

A stepwise approach that begins with all predictors in the model and removes the predictor with the largest p-value at each step, continuing until a stopping criterion is met.

84
Q

What is mixed (stepwise) selection?

A

A combination of forward and backward selection. Start with no predictors, adding the best one at a time (like forward), but also remove any predictors that have become insignificant (like backward), iterating until no more improvements can be made.

85
Q

What does ‘controlling for other variables’ mean?

A

In MLR, the coefficient βⱼ reflects the effect of Xⱼ on Y after accounting for (holding constant) all other included predictors.

86
Q

How do we interpret a negative coefficient for a predictor in MLR?

A

It indicates that, after holding other predictors constant, increases in that predictor are associated with a decrease in the response.

87
Q

When do we reject the null hypothesis in the overall F-test?

A

If the F-statistic is sufficiently large (or the corresponding p-value is sufficiently small), indicating that at least one predictor is significantly related to Y.

88
Q

What is the difference between the overall F-test and individual t-tests?

A

The F-test checks if any predictor is relevant, while t-tests check if each specific predictor’s coefficient differs from zero, given the others in the model.

89
Q

What two primary metrics are used to assess model fit in multiple linear regression?

A

The Residual Standard Error (RSE) and R² (proportion of variance explained).

90
Q

Why does R² always increase (or stay the same) when new predictors are added?

A

Because adding predictors can only reduce (or leave unchanged) the Residual Sum of Squares (RSS), thereby increasing R²—even if those predictors are only weakly related to the response.

91
Q

How can adding a predictor sometimes increase RSE even though RSS decreases?

A

RSE = √[RSS / (n - p - 1)]. If the drop in RSS is not large enough to offset the increase in p (number of predictors), then the denominator shrinks faster than RSS, causing RSE to increase.

92
Q

What does the 3D plot of TV, radio, and sales suggest?

A

It indicates a non-linear pattern in the residuals, implying that a simple linear model may underestimate sales in certain regions (e.g., where budgets are split), suggesting possible interaction or synergy between TV and radio advertising.

93
Q

What is meant by a ‘synergy’ or ‘interaction effect’ between predictors?

A

An effect in which combining multiple predictors (e.g., TV and radio advertising) yields a greater (or different) impact on the response than the sum of their individual effects alone.

94
Q

What are the three main sources of uncertainty in multiple regression predictions?

A

1) Uncertainty in the coefficient estimates (reducible error).
2) Model bias if the linear form is not exactly correct.
3) Irreducible error due to random variation in the outcome.

95
Q

How does a confidence interval differ from a prediction interval in MLR?

A

A confidence interval targets the average response for given predictor values, while a prediction interval encompasses the possible range for a single future observation. Prediction intervals are always wider.

96
Q

Why do we call the linear model an approximation of reality?

A

Real relationships can be more complex or nonlinear. The chosen linear form introduces ‘model bias’ if it doesn’t capture all the underlying structure.

97
Q

Why is the prediction interval wider than the confidence interval?

A

Because the prediction interval includes both uncertainty in estimating the mean response (reducible error) and the additional variability of an individual outcome (irreducible error).

98
Q

How do confidence intervals for βⱼ in MLR differ from simple linear regression?

A

They use the same logic (estimate ± critical value × SE), but the SE takes into account correlations among predictors and the degrees of freedom for MLR.

99
Q

What is multicollinearity in MLR?

A

A situation where two or more predictors are highly correlated, making it difficult to distinguish their individual effects on the response.

100
Q

Why is multicollinearity problematic?

A

It inflates the variance of the coefficient estimates, leading to unstable estimates and wider confidence intervals (less precision).

101
Q

What is the role of qualitative (categorical) predictors in MLR?

A

They are included via dummy (indicator) variables that take on values 0/1, allowing the model to estimate different intercepts for each category.

102
Q

What is one-hot encoding?

A

A method for handling qualitative (categorical) predictors by creating separate indicator (dummy) variables for each category, each taking values of 0 or 1.

103
Q

Given that y_i is the credit card balance for an individual, what can β_0 be interpreted as?

A

The average credit card balance for individuals
from the East.

104
Q

Given that y_i is the credit card balance, what can β_1 be interpreted as?

A

The difference in the average balance
between people from the South versus the East.

105
Q

Given that y_i is the credit card balance, what can β_2 be interpreted as?

A

The difference in the average balance between those from the West versus
the East.

106
Q

Why is there always one fewer dummy variable than the number of levels for a categorical variable?

A

Because one category serves as the ‘baseline’ (or reference), and the remaining categories each get a dummy variable (1/0). Having a dummy for every category would cause perfect multicollinearity.

107
Q

How can we test whether a categorical variable with multiple levels (e.g., region) is related to the response?

A

We perform an F-test of the joint hypothesis that all corresponding dummy coefficients (e.g., β₁ = β₂ = 0) are zero. This test does not depend on which category is chosen as the baseline.

108
Q

What is the additive assumption in linear models?

A

It states that each predictor’s effect on the response is independent of the other predictors, so the impact of one predictor does not change based on the value of another predictor.

109
Q

What is an interaction term in MLR?

A

An additional predictor created by multiplying two predictors (e.g., X₁ × X₂), allowing the effect of one predictor to depend on the level of another.

110
Q

How do you interpret an interaction term?

A

If an interaction is significant, it means the relationship between one predictor and Y changes depending on the value of another predictor.

111
Q

What is a ‘main effect’ in a regression model?

A

It’s the direct effect of a single predictor on the response, not accounting for interaction terms with other predictors.

112
Q

What is the hierarchical principle in regression?

A

It states that if an interaction (or higher-order) term is included in a model, then the corresponding lower-order (main) effects should also be included, even if they appear statistically insignificant.

113
Q

Why follow the hierarchical principle?

A

Because including an interaction X₁×X₂ without the main effects X₁ and X₂ confounds the interpretation. The interaction term can absorb the baseline effect of X₁ or X₂, and its coefficient becomes misleading. Keeping main effects clarifies the unique impact of the interaction.

114
Q

What happens when we add an interaction between a qualitative variable (e.g., student status) and a quantitative variable (e.g., income)?

A

It allows each group (e.g., students vs. non-students) to have not only its own intercept but also its own slope with respect to the quantitative variable, rather than forcing parallel lines.

115
Q

What does the model look like for an interaction between one binary dummy (student) and a numeric predictor (income)? Response variable is balance.

A

balance = β₀ + β₁ × income + β₂ × student + β₃ × (income × student). For students: (β₀ + β₂) + (β₁ + β₃) × income; for non-students: β₀ + β₁ × income.

116
Q

What is the ‘linearity assumption’ in linear regression?

A

The change in the response Y associated with a one-unit change in Xj is constant, regardless of the value of Xj.

117
Q

What is polynomial regression?

A

A way to capture non-linear relationships by including polynomial terms (e.g., X², X³) of a predictor in a linear model. The model remains ‘linear’ in parameters, but can curve with respect to X.

118
Q

Why might we add a squared (X²) term to a regression?

A

If the data suggests a curved (non-linear) relationship between X and Y, including X² can significantly improve the fit by allowing the model to bend.

119
Q

Does adding polynomial terms still produce a linear model?

A

Yes. Even though the predictors include X² or X³, the model is linear in the coefficients (β’s), so standard linear regression software can fit it.

120
Q

What potential pitfall arises from adding too many polynomial terms?

A

The model can become overly ‘wiggly’ and may overfit the data, adding complexity without a genuine improvement in predictive or explanatory power.

121
Q

What are the most common problems when fitting a linear regression model?

A
  1. Non-linearity of the response-predictor relationships
  2. Correlation of error terms
  3. Non-contant variance of error terms
  4. Outliers
  5. High-leverage points
  6. Collinearity
122
Q

Most common problems when fitting linear regression model

What is the non-linearity issue?

A

Linear regression assumes a straight-line relationship between predictors and the response. If the true relationship is curved (non-linear), the model’s predictions and inferences can be inaccurate.

123
Q

What is a residual plot?

A
  • In the case of a simple linear regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus the predictor x_i.
  • In the case of a multiple regression model, involves plotting residuals (e_i = y_i - ˆy_i) versus predicted/fitted values ˆyi.
124
Q

Most common problems when fitting linear regression model

How can we detect non-linearity in a regression model?

A

By examining residual plots. If a clear pattern (e.g., a U-shape) appears in the residuals versus fitted values, that suggests the linear model is missing a non-linear component.

125
Q

Most common problems when fitting linear regression model

What steps can be taken if we detect non-linearity?

A

One simple approach is to include polynomial (e.g., X²) or other transformations (e.g., log X, √X) of the predictors. More advanced non-linear models can also be used.

126
Q

Most common problems when fitting linear regression model

What does ‘correlation of error terms’ in linear regression mean?

A

It means the residuals are not independent—there is some systematic relationship among them, often seen in time series or clustered data.

127
Q

Most common problems when fitting linear regression model

Why is correlated error structure a problem?

A

Because many standard tests (e.g., t-tests, F-tests) assume independent errors. Correlated errors lead to incorrect, low/narrow estimates of standard errors, confidence intervals, and p-values.

128
Q

Most common problems when fitting linear regression model

How can we detect correlated errors?

A

By plotting residuals against time or their lagged values, or using specific tests like the Durbin–Watson test for autocorrelation.

129
Q

Most common problems when fitting linear regression model

What is meant by ‘non-constant variance’ of error terms?

A

Also called heteroscedasticity, it occurs when the spread (variance) of the residuals changes for different fitted values, violating the usual linear model assumption that Var(εᵢ) = σ².

130
Q

Most common problems when fitting linear regression model

How can we detect heteroscedasticity?

A

By examining residual plots: if residuals increase or decrease systematically with fitted values (e.g., funnel shape), that suggests non-constant variance.

131
Q

Most common problems when fitting linear regression model

What are common remedies for heteroscedasticity?

A

Transform the response using a concave function like log(Y) or √Y, or use weighted least squares, which gives lower weight to observations with higher variance.

132
Q

Most common problems when fitting linear regression model

Why is non-constant variance problematic?

A

Standard errors, confidence intervals, and p-values from ordinary least squares become unreliable if the variance of the errors is not constant.

133
Q

Most common problems when fitting linear regression model

What is an outlier in linear regression?

A

A data point whose observed value is far from the value predicted by the model, resulting in a large residual compared to other observations.

134
Q

Most common problems when fitting linear regression model

Why can an outlier be problematic even if it doesn’t dramatically change the slope?

A

Because a single extreme point can inflate the Residual Standard Error (RSE) and affect confidence intervals and p-values, potentially distorting inferences about the model.

135
Q

Most common problems when fitting linear regression model

How do we identify outliers?

A

By examining residual plots or studentized residuals. A studentized residual (residual divided by its estimated standard error) greater than about ±3 is often considered outlying.

136
Q

Most common problems when fitting linear regression model

What are possible actions if an outlier is identified?

A

1) Check for data-entry errors or measurement anomalies. 2) Remove it if it’s clearly erroneous. 3) Keep it if it’s a valid data point and consider whether it indicates missing predictors or model mis-specification.

137
Q

Most common problems when fitting linear regression model

What are high-leverage points in linear regression?

A

Observations whose predictor values (X’s) are unusual or far from the bulk of the data. They can have a large influence on the fitted model, even if their residuals aren’t large.

138
Q

Most common problems when fitting linear regression model

Why are high-leverage points potentially problematic?

A

Because they can disproportionately affect the regression coefficients. A single high-leverage observation can pull the fitted line or plane toward itself, distorting results.

139
Q

Most common problems when fitting linear regression model

Are high-leverage points always outliers?

A

No. A high-leverage point can have a small residual if the model is forced to pass near it. Conversely, an outlier has a large residual but might not have unusual X-values.

140
Q

Most common problems when fitting linear regression model

How can we detect high-leverage points?

A

By calculating leverage scores. Observations with hᵢ significantly larger than the average leverage (p+1)/n are considered high leverage.

141
Q

Most common problems when fitting linear regression model

Equation: Leverage statistic (hᵢ) for simple linear regression

A

hᵢ = 1/n + ( (xᵢ - x̄)² / Σᵢ(xᵢ - x̄)² )

142
Q

Most common problems when fitting linear regression model

Why is it important to check both residuals and leverage?

A

Because outliers are identified via large residuals, while high-leverage points have unusual predictor values. A data point can be high leverage, an outlier, both, or neither.

143
Q

Most common problems when fitting linear regression model

What is collinearity (multicollinearity) in linear regression?

A

It refers to predictors that are highly correlated with each other, making it hard to determine their individual effects on the response.

144
Q

Most common problems when fitting linear regression model

Why is collinearity problematic?

A

Because it inflates the standard errors of the coefficient estimates, potentially making significant predictors appear insignificant and leading to unstable estimates.

145
Q

Most common problems when fitting linear regression model

How can we detect collinearity?

A

By examining the correlation matrix among predictors or by calculating Variance Inflation Factors (VIF). Large VIF values (e.g., > 5 or 10) suggest serious multicollinearity.

146
Q

Most common problems when fitting linear regression model

What is the Variance Inflation Factor (VIF)?

A

A measure of how much the variance of a coefficient is inflated due to collinearity with other predictors.

147
Q

Most common problems when fitting linear regression model

Equation: Variance Inflation Factor (VIF)

A

VIFᵢ = 1 / (1 - Rᵢ²), where Rᵢ² is the R² from regressing predictor i on the other predictors.

148
Q

Most common problems when fitting linear regression model

What strategies can address collinearity?

A

Remove or combine highly correlated predictors.

149
Q

Most common problems when fitting linear regression model

Does collinearity always ruin the model?

A

Not necessarily. The model can still predict well, but interpreting individual coefficients becomes difficult if their estimates have large standard errors due to high collinearity.

150
Q

What method(s) can you use to answer the question: ‘Is there a relationship between sales and advertising budget?’

A

Fit a multiple regression model and conduct a hypothesis test (F-test) to see if the slope differs from zero.

151
Q

What method(s) can you use to answer the question: ‘How strong is the relationship (between sales and advertising budget)?’

A

Look at the Residual Standard Error (RSE) to gauge the average prediction error, and the R² statistic to see what fraction of the variance in sales is explained by the advertising budget. A lower RSE and higher R² both indicate a stronger relationship.

152
Q

What method(s) can you use to answer the question: ‘Which media are associated with sales?’

A

Fit a multiple linear regression model including all media as predictors, then check each predictor’s t-statistic and p-value. Predictors with low p-values are significantly related to sales.

153
Q

What method(s) can you use to answer the question: ‘How large is the association between each medium and sales?’

A

Construct confidence intervals for each medium’s regression coefficient (βᵢ) in a multiple linear regression model. The size and position of these intervals relative to zero indicate how large (and significant) each medium’s effect is.

154
Q

What method(s) can you use to answer the question: ‘How accurately can we predict future sales?’

A

Use the fitted regression model to generate either a confidence interval for the mean response (if predicting the average) or a prediction interval (if predicting an individual outcome). Prediction intervals are wider because they account for the irreducible error term.

155
Q

What method(s) can you use to answer the question: ‘Is the relationship linear?’

A

Create and inspect residual plots to see if there is a systematic pattern (indicating non-linearity). If a pattern emerges, consider adding polynomial or transformed predictors to handle non-linear effects.

156
Q

What method(s) can you use to answer the question: ‘Is there synergy among the advertising media?’

A

Include an interaction term (e.g., TV × radio) in a multiple regression model, then check if the coefficient (and its p-value) is significant. A significant interaction term suggests synergy among the media.

157
Q

What is the difference between parametric and non-parametric methods in regression?

A

Parametric methods (like linear regression) assume a functional form for f(X), with a fixed number of parameters. Non-parametric methods (like K-Nearest Neighbors) do not assume a specific form.

158
Q

What is K-Nearest Neighbors (KNN) regression?

A

A non-parametric technique that predicts a new observation’s response by averaging the responses of its K closest training points in predictor space.

159
Q

How is the prediction f-hat(x0) computed in KNN regression?

A

f-hat(x0) = (1/K) * Σ(y_i for x_i in N0)

160
Q

How does K affect the bias-variance trade-off in KNN?

A

A small K yields more flexible fits (low bias, but high variance). A large K yields smoother fits (higher bias, but lower variance), diluting local idiosyncrasies.

161
Q

In what setting will a parametric approach such as least squares linear regression outperform a non-parametric approach such as KNN regression?

A

The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of f.

162
Q

When is it helpful to use KNN rather than a linear model?

A

When the true relationship is highly non-linear or too complex for a simple parametric form. KNN can adapt more flexibly to such data if enough observations are available.

163
Q

What is the main advantage of KNN regression over linear regression?

A

It can capture complex, non-linear relationships without specifying a model form. Linear regression may miss these if the linear (or polynomial) form is too restrictive.

164
Q

What are 2 disadvantages of KNN regression compared to linear regression?

A
  • KNN can underperform in higher-dimentional problems due to the curse of dimensionality. As the number of predictors grows, data become sparse and points end up far from each other, making it hard to find truly ‘nearby neighbors.’
  • KNN provides less interpretability — there are no explicit coefficients to explain predictor effects.
165
Q

How do you fit a linear model in R with medv as the response and lstat as the predictor using the Boston data?

A

lm.fit <- lm(medv ~ lstat, data = Boston)

166
Q

How do you extract the coefficients of a linear regression model in R?

A

coef(lm.fit)

167
Q

How do you calculate confidence intervals for the coefficients of a linear regression model in R?

A

confint(lm.fit)

168
Q

How do you produce confidence intervals for new data in R using a linear regression model?

A

predict(lm.fit, newdata, interval = 'confidence')

169
Q

How do you produce prediction intervals for new data in R using a linear regression model?

A

predict(lm.fit, newdata, interval = 'prediction')

170
Q

How do you plot the data with the linear model fit in R for lstat vs. medv?

A
plot(lstat, medv)
abline(lm.fit)
171
Q

How do you display the standard diagnostic plots for a linear model in R?

A
par(mfrow = c(2, 2)))
plot(lm.fit)
172
Q

How do you plot residuals vs. fitted values in R?

A

plot(predict(lm.fit), residuals(lm.fit))

173
Q

How do you plot studentized residuals vs. predicted values in R?

A

plot(predict(lm.fit), rstudent(lm.fit))

174
Q

How do you plot leverage statistics in R?

A

plot(hatvalues(lm.fit))

175
Q

How do you fit a linear model in R with medv as the response and lstat and age as predictors using the Boston data?

A

lm.fit <- lm(medv ~ lstat + age, data = Boston)

176
Q

How do you fit a linear model in R with medv as the response and all other variables as predictors using the Boston data?

A

lm.fit <- lm(medv ~ ., data = Boston)

177
Q

How do you calculate the variance inflation factor (VIF) in R?

A
library(car)
vif(lm.fit)
178
Q

How do you fit a linear model in R with medv as the response and all other variables except age as predictors using the Boston data?

A

lm.fit1 <- lm(medv ∼ . - age, data = Boston)

179
Q

How do you modify an existing R model using the update() function?

A

Use update() with a new formula that references the old formula. For example, update(lm.fit, ~ . - age) removes the age predictor while keeping all other terms.

180
Q

Example interaction between lstat and age

What does the colon (:) syntax do for interactions in R?

A

Using lstat:age includes only the interaction term between lstat and age (no main effects).

181
Q

Example interaction between lstat and age

What does the star (*) syntax do for interactions in R?

A

Using lstat * age expands to lstat + age + lstat:age, meaning it includes both main effects and the interaction.

182
Q

How do you fit a linear model in R with medv as the response and lstat as the predictor, with a quadratic lstat term, using the Boston data?

A

lm.fit2 <- lm(medv ∼ lstat + I(lstat^2))

183
Q

What is the purpose of the anova() function when comparing nested linear models in R?

A

It performs a hypothesis test to see if the more complex model significantly improves the fit compared to the simpler (nested) model.

184
Q

How do you compare two nested models with anova() in R?

A

Call anova(model1, model2) where model1 is the simpler model and model2 is the extended model. The function returns an F-statistic and p-value for the comparison.

185
Q

How do you include higher-order polynomial terms in a linear model in R without manually specifying each power?

A

Use the poly() function. For example: lm(y ~ poly(x, 5)) fits a 5th-order polynomial in x.

186
Q

What is the difference between poly(x, 3) and poly(x, 3, raw = TRUE)?

A

poly(x, 3) uses orthogonal polynomials (less correlation, more stable estimates), while raw = TRUE produces raw powers of x (x, x², x³). Both yield the same fitted values but have different coefficient estimates.

187
Q

How do you fit a linear model in R with medv as the response and the logarithm of lstat as the predictor using the Boston data?

A

lm.fit <- lm(medv ∼ log(rm), data = Boston)

188
Q

How does R handle qualitative variables in a linear regression model by default?

A

R automatically creates dummy variables for each factor level (except the baseline), allowing regression coefficients to compare each category to the baseline.

189
Q

What does the contrasts() function do in R?

A

It shows (and can set) the coding scheme for factor variables (i.e., how factor levels map to dummy variables). For example, contrasts(ShelveLoc) displays the dummy coding.