Linear Regression & Regression Diagnostics Flashcards
Systematically introducing and reinforce key concepts in Linear Regression, Regression Diagnostics, and Statistical Modeling
What is statistical modeling?
A process of using mathematical models to represent real-world data relationships.
It helps understand and predict outcomes.
Define inferential modeling.
The use of statistical techniques to make predictions or inferences about a population based on a sample.
Common in regression analysis.
What is linear regression?
A statistical method used to model the relationship between a dependent variable and one or more independent variables.
The equation is y = β₀ + β₁X + ε.
What is simple linear regression?
A linear regression model with only one independent variable.
Example: Predicting house price based on square footage.
What is multiple linear regression?
A linear regression model with two or more independent variables.
Example: Predicting sales based on advertising spend and product price.
What does the slope in linear regression represent?
The rate at which the dependent variable changes with respect to the independent variable.
A positive slope indicates a positive relationship.
Fill in the blank:
The y-intercept (β₀) represents __________.
The predicted value of y when x = 0.
Often not meaningful in some contexts.
What is the main assumption of simple linear regression?
That there is a linear relationship between the dependent and independent variable.
Checked using scatter plots.
What does the error term (ε) in regression represent?
The difference between the observed and predicted values.
Also known as residuals.
What is the coefficient of determination (R²)?
A measure of how well the model explains the variance in the dependent variable.
R² values range from 0 to 1.
What is Adjusted R²?
A modified R² that adjusts for the number of predictors in the model.
Unlike R², it penalizes adding unnecessary predictors.
What is residual analysis in regression?
The process of analyzing the differences between observed and predicted values.
Helps detect model issues.
What is a normal Q-Q plot used for?
Checking if residuals follow a normal distribution.
A straight 45-degree line suggests normality.
What is homoscedasticity?
When residuals have constant variance.
Checked using residual vs. fitted plots.
What is heteroscedasticity?
When residual variance is not constant.
Indicates a violation of homoscedasticity.
How can you fix heteroscedasticity?
Apply transformations like:
- log, square root
- use weighted least squares regression.
Non-constant variance can distort regression results.
What is multicollinearity?
When two or more independent variables are highly correlated.
Leads to unstable coefficient estimates.
How do you detect multicollinearity?
Using Variance Inflation Factor (VIF).
VIF > 5 or 10 suggests high multicollinearity.
How can you fix multicollinearity?
Remove one of the correlated variables, use PCA, or ridge regression.
Keeping highly correlated features can distort model interpretation.
What is the normality assumption in regression?
Residuals should be normally distributed.
Checked using Q-Q plots or Shapiro-Wilk test.
What is the Durbin-Watson test used for?
Detecting autocorrelation in regression residuals.
A value close to 2 suggests no autocorrelation.
What is Cook’s Distance?
A measure to identify influential outliers.
Points with Cook’s Distance >1 may be problematic.
What is Ridge Regression?
A regression technique that adds an L2 penalty to shrink coefficients.
Helps in handling multicollinearity.
What is LASSO Regression?
A regression technique that adds a L1 penalty, shrinking some coefficients to zero.
Helps with feature selection.
What is Elastic Net Regression?
A mix of Ridge and LASSO regression.
Uses both L1 and L2 penalties.
What is the purpose of polynomial regression?
To model a non-linear relationship using higher-degree terms.
Example: y = β₀ + β₁X + β₂X² + ε.
When should you use logistic regression instead of linear regression?
When predicting a binary outcome.
Example: Predicting yes/no responses.
What is cross-validation in regression?
A method to assess how well the model generalizes to new data.
Common technique: k-fold cross-validation.
How can overfitting be prevented in regression?
By using regularization techniques like Ridge or LASSO.
Overfitting leads to poor generalization.
What is the main goal of regression diagnostics?
To verify if model assumptions are met.
Helps improve model reliability.
What is the F-test used for in regression?
To check if at least one predictor variable is significantly contributing to the model.
A low p-value (<0.05) suggests significance.
What does a small p-value for a regression coefficient mean?
That the predictor variable is significantly contributing to the model.
Typically, p < 0.05 is considered significant.
What is mean squared error (MSE)?
The average squared difference between actual and predicted values.
Lower MSE means better model performance.
What is root mean squared error (RMSE)?
The square root of MSE, measuring average prediction error in the same units as the dependent variable.
More interpretable than MSE.
What does it mean if residuals are large?
The model’s predictions are not very accurate.
Large residuals suggest potential outliers or a poor model fit.
What is the main assumption of the least squares method?
That residuals are normally distributed and have constant variance.
Used for estimating regression coefficients.
Why should you be cautious about extrapolating in linear regression?
The model is only valid within the range of observed data.
Predictions outside this range may be unreliable.
What is Bayesian regression?
A type of regression that incorporates prior beliefs using probability distributions.
Used when dealing with small datasets or uncertainty.
What is robust regression?
A regression method that reduces the influence of outliers.
More resistant to violations of normality and homoscedasticity.
What is the main advantage of using StatsModels for regression?
Provides detailed statistical summaries and diagnostics.
Offers better inference than Scikit-learn.