Linear Regression Flashcards

1
Q

What is Linear Regression?

A

Predicting quantitative response variable (Y), given a single predictor variable (X).

The estimation is done using the Least Square Method, which minimizes the Residual Sum of Squares (RSS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Mean Sum of Square?

A

The average of RSS calculated over all the data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which statistic do we use in the case of Linear regression to check the significance of the predictor variable?

List out Ho and Ha.

A

t-Statistic

Ho - No relationship b/w “X” and “Y”.
Ha - Some relationship b/w “X” and “Y”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What question to ask before applying Linear Regression?

A
  1. Is there a relationship between Regression and Predictor variable?
  2. If, yes. How strong is the relationship?
  3. Is the relationship linear?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Error term (e) in the equation of linear regression.

A
  1. The true relationship may not be linear.
  2. Measurement error.

Error is independent of “X”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the meaning of a zero R2?

A

Value near zero indicates that the model is not able to explain any of the variability in the response.
When, SSR = SST. It is as good as taking the mean value of response variable, without using predictor variable for modelling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the significance of R2 in context of linear regression?

A

R2 statistic is a measure of the linear relationship between X and Y, just like r.

It can be shown that R2 = r2.

r = Corr(X,Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why, instead of multiple linear regression, we can’t use multiple-simple linear regression?

A
  1. It is unclear on how to make prediction based on different simple linear regression.
  2. The predictor variables may be correlated with each other, But in case of multiple simple linear regression, we are assuming that they are completely independent of each other.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which statistic do we use in the case of Multiple Linear regression to check the significance of the predictor variable?

List out Ho and Ha.

A

F-Statistic

Ho - B1 = B2 = …… = Bp = 0
Ha - At least one of Bj ≠ 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What should be the value of F statistic if there is no relationship between the response and the predictor variable in multiple linear regression.

A

Value close to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What should be the value of F statistic if there is a relationship between the response and the predictor variable in multiple linear regression.

A

Value greater than 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Given the individual p values for each variable, why do we need to look at the overall F-statistic?

A

This is because, if no. of predictor variable (p) is large, there is a 5% probability that at least one of them will have p-value less than 0.05 suggesting that the predictor variable is significant. However, this is not the case in case of F-statistic, which adjusts for the number of predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What to do if no. of predictor variable is more than number of data points?

A

In that case, we can not fit multiple linear regression, and we cannot use F-statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the ways of variable selection in context of linear regression

A
  1. Forward Selection - Start with null model with no predictors. Keep on adding predictor that result in lowest RSS. Continue this until some stopping rule is reached.
  2. Backward Selection - Start with all the predictors. Remove the predictor with largest p-value, repeat this until a stopping point is reached.
  3. Mixed Selection - Start with null model with no predictors. Keep on adding predictor that result in lowest RSS. Remove the predictor with largest p-value, repeat this forward & backward selection until a stopping point is reached.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the relationship between R2 and correlation in multiple linear regression?

A

R2 = Corr(Y,Y_pred)2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why do we need variable selection, if R2 increases when more variables are added, even if the variables are weakly associated with the response.

A

To prevent overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How RSE increases when extra variable is added to the model, given that RSS decreases?

A

Models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p (no. of predictor variable)

RSE = Sqrt[RSS/(n-p-1)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the additive assumption of linear regression model?

A

The effect of change in predictor Xj on the response Y is independent of the value of the other predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the linear assumption of linear regression model?

A

A change in response Y due to one unit change in Xj is constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is synergy effect or interaction effect?

A

A change in one predictor variable affects other predictor variables.

The effect by two different variables combined is more than the sum of their individual effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

If we are including an interaction term in a model, should we include the main term also, if the p-value for the main term is not significant?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Explain polynomial regression in context of multiple linear regression.

A

Polynomial regression is a type of multiple linear regression in which other predictor variable are a polynomial function of one predictor variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How to identify nonlinearity in the data?
(Assumption 1)

A

Residual Plots

For SLR - Residuals vs Predictor
For MLR - Residuals vs Fitted Value

There should not be any pattern evident in Residual Plot
Residual plot indicates that there’s no trend in the residuals, no outliers, and in general, no changing variance across time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What to do when non linearity is present in the data? (In context of Regression)

A

Nonlinear transformation like log(X), X2, SQRT(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Heteroscedasticity

A

Non constant variance in residuals

The variance of error is not constant across observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

How to tackle heteroscedasticity

A

Transformation of response Y using a concave function like log Y or Sqrt(Y).

Weighted Least Square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Outlier

A

Outliers are data points that deviate significantly from the expected patterns or values.

Outliers have extreme values in their response variable.

They can arise due to measurement errors, data entry mistakes, or genuine extreme values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

High Leverage Point

A

Leverage points are data points that have a significant impact on the estimated regression coefficients. These points can distort the regression line.

Unlike outliers, leverage points have extreme values in their predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How to identify outliers?

A

Box plot
Residual Plots
Scatter Plots

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How to detect leverage point?

A

Leverage statistic
Cook’s Distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Studentized Residuals

A

Computed by dividing each residual by its estimated standard error (std. deviation).

For each data point i, the point is deleted and the regression model is re-estimated with the remaining data points. Residual for each data point is calculated to find the std. deviation for residuals.

Studentized residuals > 3 is a possible outlier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is collinearity?

A

A situation in which two and more variables are closely related to each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What problems multicollinearity poses to the model?

A

Multicollinearity reduces the accuracy of the estimate of the regression coefficient. Estimated regression coefficients become unstable and difficult to interpret in the presence of multicollinearity. The interpretability of model reduces as coefficients become less reliable.

The power of hypothesis test (Significance of predictor) is reduced by collinearity.

34
Q

How to detect collinearity?

A

Correlation matrix

35
Q

What is Multicollinearity?

A

If collinearity exists between two or more independent variables. Even if no pair of variables has a high correlation.

36
Q

How to detect Multicollinearity?

A

Variance inflation factor(VIF) - It is predicted by taking a variable and regressing it against every other variable.

VIF = 1/(1 - R2)
R2 explains how well an independent variable is explained by other variable.

VIF = 1, No Multicollinearity
VIF > 5, High Multicollinearity

37
Q

How to tackle Multicollinearity

A
  1. Drop the variable iteratively (start with the variable having the largest VIF).
  2. Combine the collinear variable together into a single predictor.
  3. Use a dimensionality reduction technique (such as PCA).
38
Q

What are the two ways in which you can get the best fit line for SLR?

A
  1. OLS - Closed form solution
  2. Gradient Descent - Non closed form solution
39
Q

Advantage & Disadvantage of MAE

A

Advantage:
1. Interpretable Unit (Same as response)
2. Robust to outliers

Disadvantage:
1. Not differentiable at 0

40
Q

Advantage & Disadvantage of MSE

A

Advantage:
1. Differentiable at 0

Disadvantage:
1. Not robust to outliers
2. Not Interpretable Unit (Square as response)

41
Q

Advantage & Disadvantage of RMSE

A

Advantage:
1. Differentiable at 0
2. Interpretable Unit (Square as response)

Disadvantage:
1. Not so robust to outliers

42
Q

Can R2 be negative?

A

Yes, for the test data.

Since, the model is build on training data, and tested on test data, SSR may exceed SST.
In context of linear regression, SST is the mean of all the value of response variable. But when SSR Line is wayward than the value of SSR can be greater than SST.

43
Q

Disadvantage of R2 score

A

It doesn’t account for the no. of features. If no. of features increases, R^2 increases too.

This suggests that a model with higher no. of features is always better than a model with lesser no. of features. But that’s not always the case.

44
Q

Why should we choose Gradient descent over OLS for multiple regression?

A

This is because Multiple Regression using OLS includes matrix inversion which is O(n^2.373)

45
Q

Deviance

A

A goodness-of-fit statistic.

It is a generalization of the idea of using the sum of squares of residuals (SSR) in OLS to cases where model-fitting is achieved by maximum likelihood.

46
Q

How are the coefficients affected in case of Ridge regression?

A

With increase in α the coefficients reduces towards zero, but never becomes zero.

47
Q

How are the coefficients affected in case of Lasso regression?

A

With increase in α the coefficients reduces towards zero, and eventually becomes zero.

48
Q

Are all coefficient affected equally in case of Ridge regression?

A

All coefficient are shrunken by same proportion.
Therefore, higher coefficients are affected more (in value).

49
Q

Are all coefficient affected equally in case of Lasso regression?

A

All coefficient shrinks towards zero by a constant amount.

50
Q

How does bias-variance trade-off happen in Ridge and Lasso?

A

As α increases,
Bias increases
Variance decreases

Choose α near the point of intersection of bias and variance. (May be just before intersection)

51
Q

Why is Ridge regression called so?

A

Ridge regression eliminates the ridge formed in the likelihood function

52
Q

When to use Ridge and Lasso?

A

When the goal is feature selection with more interpretable model - Use Lasso

When the goal is to reduce the impact of less important feature still keeping all the feature - Use Ridge

53
Q

What is Ridge and Lasso regression?

A

Ridge and Lasso are the regularization technique in which we include a penalty term in loss function of linear regression which shrinks the coefficients of predictor variable.

In case of Ridge, penalty term is sum of squared coefficient values multiplied by tuning parameter (lambda)

In case of Lasso, penalty term is sum of absolute coefficient values multiplied by tuning parameter (lambda)

54
Q

Why lasso regularization creates sparsity but ridge do not?

A

In the equation of coefficient:

In lasso, λ term is in Numerator
In ridge, λ term is in Denominator

55
Q

What is Elastic net regression?

A

Combination of ridge and lasso.

L = MSE + a||w^2|| + b||w||

λ = a+b
l1_ratio = a/(a+b)

56
Q

Best Subset selection

A

Models with one predictor + models with two predictors + models with three predictors and so on.

Total = 2^p models

For same no. of predictor, model with lowest RSS is selected. Then among the model with different predictors, we use AIC, BIC, Cp, Adjusted R2.

Computationally expensive

57
Q

Forward stepwise selection

A

1+p(p+1)/2

Starts with null model; predictors are added one-at-a-time

Computationally efficient
Best model is not guaranteed as we are not exploring all options
can also be applied when n<p

58
Q

Backward stepwise selection

A

1+p(p+1)/2

Start with full model; least useful predictor is removed iteratively (Having least RSS or highest R2)

Single best model is selected using AIC, BIC, Adjsuted R2 or cross-validation error.

Computationally efficient
Best model is not guaranteed as we are not exploring all options

59
Q

Why training set RSS and R2 cannot be used to evaluate the performance of a model?

A

Because with every increase in predictor variable, the training metrics will improve, which is a case of overfitting.

60
Q

Mallow’s Cp

A

It adds a penalty to the training RSS, In order to adjust for the fact that training error over underestimate the test error.

This penalty increases as the number of predictor in the model increase.

The best model is with the least Cp

61
Q

AIC

A

AIC = -2 * log(L) + 2 * k
Finds a model that maximizes the likelihood of the data while taking into account the number of parameters used.
By incorporating both the likelihood (measures how well the model fits the data) and the number of parameters, AIC strikes a balance between model fit and complexity.

62
Q

BIC

A

BIC = -2 * log(L) + k * log(n)
Imposes a more stronger penalty for model complexity compared to AIC
Imposes heavier penalty (than Cp) on model with many variable
Therefore, BIC tends to favor simpler models compared to AIC

63
Q

Adjusted R2

A

1 - [RSS/(n-d-1)]/[TSS/(n-1)]

The intuition behind the adjusted R2 that once all the correct variables have been included in the model, adding additional noise variables will lead to very small decrease in RSS and further increase in d, leading to an increase in [RSS/(n-d-1)]. And consequently, a decrease in the adjusted R2.

Therefore, a model with the largest adjusted R2 will have all correct variables and no noise variables.

64
Q

Why should validation and cross validation be preferred over model performance metrics like AIC, BIC, adjusted R2 for model evaluation?

A

This is because validation give direct estimate of the test set, rather than making some assumptions and providing indirect estimate.

Parameters like AIC, BIC penalize the number of parameters and often suggest a model with lower no. of parameters.

65
Q

Is shrinkage penalty also applied to the intercept term?

A

No,

The goal is to shrink the association of each predictor to the response, whereas intercept is the mean value of response when the value of all predictor is 0.

66
Q

With the increase in α, can the individual coefficient increase?

A

Yes,

While the summation of the each coefficient decreases as α increases. The individual coefficient may increase.

67
Q

Are the regularization technique like ridge and lasso applied before standardization?

A

No, after standardization.

This is because one variable can be measured in different ways and every time it will have a different impact on the coefficients, because the coefficient of all of the variables is optimised using the summation of all the coefficients.

68
Q

What is the major disadvantage of ridge regression?

A

The model still includes all the p predictors. This may not be a problem for accuracy but may create challenge in interpretation.

69
Q

When is ridge regression expected to perform better than lasso?

A

When the response is function of many predictor and all are significant.

70
Q

PCR

A

It involves constructing M Principal component and using these component as a predictor in linear regression model fit using least square.

71
Q

Since, PCR is using less feature to model. Can it be called a feature selection technique?

A

No, Because all components are a linear combination of p original predictors.

72
Q

Do we need to standardize the predictor before applying PCR?

A

Yes, standardization ensures that all variables are on the same scale.

73
Q

Is PCR an unsupervised technique?

A

No,
PCR is considered a supervised technique because it uses the principal components (obtained through an unsupervised method) to perform a regression task.

74
Q

Partial least square

A

PLS is a supervised alternative to PCR.
It is a dimension reduction method which identifies linear combination of the original feature, fit it to the linear model using least square method.

PLS computes principal components using the fact that the least square coefficient of predictor is proportional to the correlation between response and that predictor.

75
Q

What is the problem with the high dimensional data?

A

Using any standard technique to fit the high dimensional data will ensure that the model is perfectly fit to the data and the residuals are zero.

76
Q

Assumptions of Linear Regression

A
  1. Linear relationship b/w response and each predictor variable.
  2. No Multicollinearity
  3. Normality of residual
  4. Homoscedasticity
  5. No autocorrelation of residuals
77
Q

Why is multicollinearity bad?

A

Y = Bo + B1X1 + B2X2 + ………. + BnXn

Each coefficient explains the relationship between response and that predictor variable keeping other predictor variables constant.
But, in case of multicollinearity, if two or more variables are related, changing one variable will affect other variable too. So, coefficient of that predictor variable won’t be a true measure of linear relation b/w response and that predictor variable.

78
Q

What is Homoscedasticity?

A

Homo - Same
Scedasticity - Scatter (spread)

Having the same scatter

Plotted as Y_pred vs Residuals

79
Q

ACF of the residuals

A

The ideal ACF of residuals is that there aren’t any significant correlations for any lag.

80
Q

How to check Normality of residuals?
(Assumption 3)

A
  1. Q-Q Plot
  2. Violin Plot
  3. Histogram
  4. Jarque Bera test
  5. Shapiro Wilk Test
81
Q

Is it always important to remove multicollinearity?

A
  • When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option
  • If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.