Assumptions of linear regression Flashcards
Linearity - What is it?
Linearity in linear regression means the relationship between the independent variables (predictors) and the dependent variable (target) is linear. This implies that a change in a predictor leads to a proportional change in the target.
Linearity - Why is it needed?
Linearity is needed because linear regression assumes a linear relationship between predictors and the target. If this assumption is violated, the model will produce inaccurate predictions.
Linearity - What if it fails?
If linearity fails, the model will underfit the data, leading to poor predictions. The model may miss important non-linear patterns in the data.
Linearity - How to detect it?
Detect linearity by plotting the dependent variable against each independent variable (scatterplots). Look for non-linear patterns or curves in the data.
Linearity - Remedies
Transform predictors or the target variable using non-linear transformations like log, square root, or polynomial terms. Example: Use np.log(X)
or X**2
in Python.
Multicollinearity - What is it?
Multicollinearity occurs when two or more independent variables in a regression model are highly correlated, making it difficult to isolate their individual effects on the target.
Multicollinearity - Why is it needed?
It is needed to ensure that each predictor provides unique information to the model. Without it, the model’s coefficients become unreliable and hard to interpret.
Multicollinearity - What if it fails?
If multicollinearity fails, the model’s coefficients may have inflated standard errors, leading to unreliable p-values and confidence intervals.
Multicollinearity - How to detect it?
Detect multicollinearity using the Variance Inflation Factor (VIF). A VIF greater than 5 or 10 indicates high multicollinearity.
Multicollinearity - Remedies
Remove or combine correlated variables. Use dimensionality reduction techniques like PCA. Example: from statsmodels.stats.outliers_influence import variance_inflation_factor
to calculate VIF.
Homoscedasticity - What is it?
Homoscedasticity means the residuals (errors) in a regression model have constant variance across all levels of the predictor variables.
Homoscedasticity - Why is it needed?
It is needed to ensure reliable confidence intervals, p-values, and unbiased coefficient estimates in the regression model.
Homoscedasticity - What if it fails?
If homoscedasticity fails (heteroscedasticity), the model’s standard errors become unreliable, leading to misleading p-values and confidence intervals.
Homoscedasticity - How to detect it?
Detect homoscedasticity by plotting residuals vs. predicted values. Look for patterns like cones or funnels. Use statistical tests like the Breusch-Pagan test.
Homoscedasticity - Remedies
Transform the dependent variable (e.g., log(y)
), use Weighted Least Squares (WLS), or apply robust standard errors. Example: model = sm.OLS(y, X).fit(cov_type='HC3')
.
Normality of Residuals - What is it?
Normality of residuals means the residuals (errors) of the regression model are normally distributed, especially for small sample sizes.
Normality of Residuals - Why is it needed?
It is needed for valid hypothesis testing (e.g., p-values and confidence intervals) and to ensure the model’s predictions are reliable.
Normality of Residuals - What if it fails?
If normality fails, hypothesis tests (e.g., t-tests, F-tests) become invalid, and the model’s predictions may be biased.
Normality of Residuals - How to detect it?
Detect normality using a Q-Q plot (quantile-quantile plot) or statistical tests like the Shapiro-Wilk test.
Normality of Residuals - Remedies
Transform the dependent variable (e.g., log(y)
), use non-linear models, or increase sample size. Example: from scipy.stats import shapiro
for the Shapiro-Wilk test.
No Autocorrelation - What is it?
No autocorrelation means the residuals of the regression model are not correlated with each other, especially in time-series data.
No Autocorrelation - Why is it needed?
It is needed to ensure that the residuals are independent, which is crucial for valid hypothesis testing and reliable predictions.
No Autocorrelation - What if it fails?
If autocorrelation fails, the model’s standard errors are underestimated, leading to overly optimistic p-values and confidence intervals.
No Autocorrelation - How to detect it?
Detect autocorrelation using the Durbin-Watson test or by plotting residuals over time (e.g., a time-series plot).
No Autocorrelation - Remedies
Use lagged variables, include time-based features, or switch to models designed for time-series data (e.g., ARIMA). Example: from statsmodels.stats.stattools import durbin_watson
.