Assumptions of linear regression Flashcards

1
Q

Linearity - What is it?

A

Linearity in linear regression means the relationship between the independent variables (predictors) and the dependent variable (target) is linear. This implies that a change in a predictor leads to a proportional change in the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linearity - Why is it needed?

A

Linearity is needed because linear regression assumes a linear relationship between predictors and the target. If this assumption is violated, the model will produce inaccurate predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Linearity - What if it fails?

A

If linearity fails, the model will underfit the data, leading to poor predictions. The model may miss important non-linear patterns in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Linearity - How to detect it?

A

Detect linearity by plotting the dependent variable against each independent variable (scatterplots). Look for non-linear patterns or curves in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Linearity - Remedies

A

Transform predictors or the target variable using non-linear transformations like log, square root, or polynomial terms. Example: Use np.log(X) or X**2 in Python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Multicollinearity - What is it?

A

Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to isolate their individual effects on the target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Multicollinearity - Why is it needed?

A

It is needed to ensure that each predictor provides unique information to the model. Without it, the model’s coefficients become unreliable and hard to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Multicollinearity - What if it fails?

A

If multicollinearity fails, the model’s coefficients may have inflated standard errors, leading to unreliable p-values and confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Multicollinearity - How to detect it?

A

Detect multicollinearity using the Variance Inflation Factor (VIF). A VIF greater than 5 or 10 indicates high multicollinearity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multicollinearity - Remedies

A

Remove or combine correlated variables. Use dimensionality reduction techniques like PCA. Example: from statsmodels.stats.outliers_influence import variance_inflation_factor to calculate VIF.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Homoscedasticity - What is it?

A

Homoscedasticity means the residuals (errors) in a regression model have constant variance across all levels of the predictor variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Homoscedasticity - Why is it needed?

A

It is needed to ensure reliable confidence intervals, p-values, and unbiased coefficient estimates in the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Homoscedasticity - What if it fails?

A

If homoscedasticity fails (heteroscedasticity), the model’s standard errors become unreliable, leading to misleading p-values and confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Homoscedasticity - How to detect it?

A

Detect homoscedasticity by plotting residuals vs. predicted values. Look for patterns like cones or funnels. Use statistical tests like the Breusch-Pagan test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Homoscedasticity - Remedies

A

Transform the dependent variable (e.g., log(y)), use Weighted Least Squares (WLS), or apply robust standard errors. Example: model = sm.OLS(y, X).fit(cov_type='HC3').

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Normality of Residuals - What is it?

A

Normality of residuals means the residuals (errors) of the regression model are normally distributed, especially for small sample sizes.

17
Q

Normality of Residuals - Why is it needed?

A

It is needed for valid hypothesis testing (e.g., p-values and confidence intervals) and to ensure the model’s predictions are reliable.

18
Q

Normality of Residuals - What if it fails?

A

If normality fails, hypothesis tests (e.g., t-tests, F-tests) become invalid, and the model’s predictions may be biased.

19
Q

Normality of Residuals - How to detect it?

A

Detect normality using a Q-Q plot (quantile-quantile plot) or statistical tests like the Shapiro-Wilk test.

20
Q

Normality of Residuals - Remedies

A

Transform the dependent variable (e.g., log(y)), use non-linear models, or increase sample size. Example: from scipy.stats import shapiro for the Shapiro-Wilk test.

21
Q

No Autocorrelation - What is it?

A

No autocorrelation means the residuals of the regression model are not correlated with each other, especially in time-series data.

22
Q

No Autocorrelation - Why is it needed?

A

It is needed to ensure that the residuals are independent, which is crucial for valid hypothesis testing and reliable predictions.

23
Q

No Autocorrelation - What if it fails?

A

If autocorrelation fails, the model’s standard errors are underestimated, leading to overly optimistic p-values and confidence intervals.

24
Q

No Autocorrelation - How to detect it?

A

Detect autocorrelation using the Durbin-Watson test or by plotting residuals over time (e.g., a time-series plot).

25
Q

No Autocorrelation - Remedies

A

Use lagged variables, include time-based features, or switch to models designed for time-series data (e.g., ARIMA). Example: from statsmodels.stats.stattools import durbin_watson.