Linear Regression Flashcards

Question

Heteroscedasticity

Answer 1

Non constant variance in residuals The variance of error is not constant across observations.

Answer 2

Transformation of response Y using a concave function like log Y or Sqrt(Y). Weighted Least Square

Answer 3

Outliers are data points that deviate significantly from the expected patterns or values. Outliers have extreme values in their response variable. They can arise due to measurement errors, data entry mistakes, or genuine extreme values.

Answer 4

Leverage points are data points that have a significant impact on the estimated regression coefficients. These points can distort the regression line. Unlike outliers, leverage points have extreme values in their predictors.

Answer 5

Box plot Residual Plots Scatter Plots

Answer 6

Leverage statistic Cook's Distance

Answer 7

Computed by dividing each residual by its estimated standard error (std. deviation). For each data point i, the point is deleted and the regression model is re-estimated with the remaining data points. Residual for each data point is calculated to find the std. deviation for residuals. Studentized residuals > 3 is a possible outlier.

Answer 8

A situation in which two and more variables are closely related to each other.

Answer 9

Multicollinearity reduces the accuracy of the estimate of the regression coefficient. Estimated regression coefficients become unstable and difficult to interpret in the presence of multicollinearity. The interpretability of model reduces as coefficients become less reliable. The power of hypothesis test (Significance of predictor) is reduced by collinearity.

Answer 10

Correlation matrix

Answer 11

If collinearity exists between two or more independent variables. Even if no pair of variables has a high correlation.

Answer 12

Variance inflation factor(VIF) - It is predicted by taking a variable and regressing it against every other variable. VIF = 1/(1 - R2) R2 explains how well an independent variable is explained by other variable. VIF = 1, No Multicollinearity VIF > 5, High Multicollinearity

Answer 13

1. Drop the variable iteratively (start with the variable having the largest VIF). 2. Combine the collinear variable together into a single predictor. 3. Use a dimensionality reduction technique (such as PCA).

Answer 14

1. OLS - Closed form solution 2. Gradient Descent - Non closed form solution

Answer 15

Advantage: 1. Interpretable Unit (Same as response) 2. Robust to outliers Disadvantage: 1. Not differentiable at 0

Answer 16

Advantage: 1. Differentiable at 0 Disadvantage: 1. Not robust to outliers 2. Not Interpretable Unit (Square as response)

Answer 17

Advantage: 1. Differentiable at 0 2. Interpretable Unit (Square as response) Disadvantage: 1. Not so robust to outliers

Answer 18

Yes, for the test data. Since, the model is build on training data, and tested on test data, SSR may exceed SST. In context of linear regression, SST is the mean of all the value of response variable. But when SSR Line is wayward than the value of SSR can be greater than SST.

Answer 19

It doesn't account for the no. of features. If no. of features increases, R^2 increases too. This suggests that a model with higher no. of features is always better than a model with lesser no. of features. But that's not always the case.

Answer 20

This is because Multiple Regression using OLS includes matrix inversion which is O(n^2.373)

Answer 21

A goodness-of-fit statistic. It is a generalization of the idea of using the sum of squares of residuals (SSR) in OLS to cases where model-fitting is achieved by maximum likelihood.

Answer 22

With increase in α the coefficients reduces towards zero, but never becomes zero.

Answer 23

With increase in α the coefficients reduces towards zero, and eventually becomes zero.

Answer 24

All coefficient are shrunken by same proportion. Therefore, higher coefficients are affected more (in value).

Answer 25

All coefficient shrinks towards zero by a constant amount.

Answer 26

As α increases, Bias increases Variance decreases Choose α near the point of intersection of bias and variance. (May be just before intersection)

Answer 27

Ridge regression eliminates the ridge formed in the likelihood function

Answer 28

When the goal is feature selection with more interpretable model - Use Lasso When the goal is to reduce the impact of less important feature still keeping all the feature - Use Ridge

Answer 29

Ridge and Lasso are the regularization technique in which we include a penalty term in loss function of linear regression which shrinks the coefficients of predictor variable. In case of Ridge, penalty term is sum of squared coefficient values multiplied by tuning parameter (lambda) In case of Lasso, penalty term is sum of absolute coefficient values multiplied by tuning parameter (lambda)

Answer 30

In the equation of coefficient: In lasso, λ term is in Numerator In ridge, λ term is in Denominator

Answer 31

Combination of ridge and lasso. L = MSE + a||w^2|| + b||w|| λ = a+b l1_ratio = a/(a+b)

Answer 32

Models with one predictor + models with two predictors + models with three predictors and so on. Total = 2^p models For same no. of predictor, model with lowest RSS is selected. Then among the model with different predictors, we use AIC, BIC, Cp, Adjusted R2. Computationally expensive

Answer 33

1+p(p+1)/2 Starts with null model; predictors are added one-at-a-time Computationally efficient Best model is not guaranteed as we are not exploring all options can also be applied when n

Answer 34

1+p(p+1)/2 Start with full model; least useful predictor is removed iteratively (Having least RSS or highest R2) Single best model is selected using AIC, BIC, Adjsuted R2 or cross-validation error. Computationally efficient Best model is not guaranteed as we are not exploring all options

Answer 35

Because with every increase in predictor variable, the training metrics will improve, which is a case of overfitting.

Answer 36

It adds a penalty to the training RSS, In order to adjust for the fact that training error over underestimate the test error. This penalty increases as the number of predictor in the model increase. The best model is with the least Cp

Answer 37

AIC = -2 * log(L) + 2 * k Finds a model that maximizes the likelihood of the data while taking into account the number of parameters used. By incorporating both the likelihood (measures how well the model fits the data) and the number of parameters, AIC strikes a balance between model fit and complexity.

Answer 38

BIC = -2 * log(L) + k * log(n) Imposes a more stronger penalty for model complexity compared to AIC Imposes heavier penalty (than Cp) on model with many variable Therefore, BIC tends to favor simpler models compared to AIC

Answer 39

1 - [RSS/(n-d-1)]/[TSS/(n-1)] The intuition behind the adjusted R2 that once all the correct variables have been included in the model, adding additional noise variables will lead to very small decrease in RSS and further increase in d, leading to an increase in [RSS/(n-d-1)]. And consequently, a decrease in the adjusted R2. Therefore, a model with the largest adjusted R2 will have all correct variables and no noise variables.

Answer 40

This is because validation give direct estimate of the test set, rather than making some assumptions and providing indirect estimate. Parameters like AIC, BIC penalize the number of parameters and often suggest a model with lower no. of parameters.

Answer 41

No, The goal is to shrink the association of each predictor to the response, whereas intercept is the mean value of response when the value of all predictor is 0.

Answer 42

Yes, While the summation of the each coefficient decreases as α increases. The individual coefficient may increase.

Answer 43

No, after standardization. This is because one variable can be measured in different ways and every time it will have a different impact on the coefficients, because the coefficient of all of the variables is optimised using the summation of all the coefficients.

Answer 44

The model still includes all the p predictors. This may not be a problem for accuracy but may create challenge in interpretation.

Answer 45

When the response is function of many predictor and all are significant.

Answer 46

It involves constructing M Principal component and using these component as a predictor in linear regression model fit using least square.

Answer 47

No, Because all components are a linear combination of p original predictors.

Answer 48

Yes, standardization ensures that all variables are on the same scale.

Answer 49

No, PCR is considered a supervised technique because it uses the principal components (obtained through an unsupervised method) to perform a regression task.

Answer 50

PLS is a supervised alternative to PCR. It is a dimension reduction method which identifies linear combination of the original feature, fit it to the linear model using least square method. PLS computes principal components using the fact that the least square coefficient of predictor is proportional to the correlation between response and that predictor.

Answer 51

Using any standard technique to fit the high dimensional data will ensure that the model is perfectly fit to the data and the residuals are zero.

Answer 52

1. Linear relationship b/w response and each predictor variable. 2. No Multicollinearity 3. Normality of residual 4. Homoscedasticity 5. No autocorrelation of residuals

Answer 53

Y = Bo + B1X1 + B2X2 + .......... + BnXn Each coefficient explains the relationship between response and that predictor variable keeping other predictor variables constant. But, in case of multicollinearity, if two or more variables are related, changing one variable will affect other variable too. So, coefficient of that predictor variable won't be a true measure of linear relation b/w response and that predictor variable.

Answer 54

Homo - Same Scedasticity - Scatter (spread) Having the same scatter Plotted as Y_pred vs Residuals

Answer 55

The ideal ACF of residuals is that there aren’t any significant correlations for any lag.

Answer 56

1. Q-Q Plot 2. Violin Plot 3. Histogram 4. Jarque Bera test 5. Shapiro Wilk Test

Answer 57

* When you care more about how much each individual feature rather than a group of features affects the target variable, then removing multicollinearity may be a good option * If multicollinearity is not present in the features you are interested in, then multicollinearity may not be a problem.

Linear Regression Flashcards

(81 cards)