Linear Regression & Regression Diagnostics Flashcards
Systematically introducing and reinforce key concepts in Linear Regression, Regression Diagnostics, and Statistical Modeling
What is statistical modeling?
A process of using mathematical models to represent real-world data relationships.
It helps understand and predict outcomes.
Define inferential modeling.
The use of statistical techniques to make predictions or inferences about a population based on a sample.
Common in regression analysis.
What is linear regression?
A statistical method used to model the relationship between a dependent variable and one or more independent variables.
The equation is y = β₀ + β₁X + ε.
What is simple linear regression?
A linear regression model with only one independent variable.
Example: Predicting house price based on square footage.
What is multiple linear regression?
A linear regression model with two or more independent variables.
Example: Predicting sales based on advertising spend and product price.
What does the slope in linear regression represent?
The rate at which the dependent variable changes with respect to the independent variable.
A positive slope indicates a positive relationship.
Fill in the blank:
The y-intercept (β₀) represents __________.
The predicted value of y when x = 0.
Often not meaningful in some contexts.
What is the main assumption of simple linear regression?
That there is a linear relationship between the dependent and independent variable.
Checked using scatter plots.
What does the error term (ε) in regression represent?
The difference between the observed and predicted values.
Also known as residuals.
What is the coefficient of determination (R²)?
A measure of how well the model explains the variance in the dependent variable.
R² values range from 0 to 1.
What is Adjusted R²?
A modified R² that adjusts for the number of predictors in the model.
Unlike R², it penalizes adding unnecessary predictors.
What is residual analysis in regression?
The process of analyzing the differences between observed and predicted values.
Helps detect model issues.
What is a normal Q-Q plot used for?
Checking if residuals follow a normal distribution.
A straight 45-degree line suggests normality.
What is homoscedasticity?
When residuals have constant variance.
Checked using residual vs. fitted plots.
What is heteroscedasticity?
When residual variance is not constant.
Indicates a violation of homoscedasticity.
How can you fix heteroscedasticity?
Apply transformations like:
- log, square root
- use weighted least squares regression.
Non-constant variance can distort regression results.
What is multicollinearity?
When two or more independent variables are highly correlated.
Leads to unstable coefficient estimates.
How do you detect multicollinearity?
Using Variance Inflation Factor (VIF).
VIF > 5 or 10 suggests high multicollinearity.
How can you fix multicollinearity?
Remove one of the correlated variables, use PCA, or ridge regression.
Keeping highly correlated features can distort model interpretation.
What is the normality assumption in regression?
Residuals should be normally distributed.
Checked using Q-Q plots or Shapiro-Wilk test.
What is the Durbin-Watson test used for?
Detecting autocorrelation in regression residuals.
A value close to 2 suggests no autocorrelation.
What is Cook’s Distance?
A measure to identify influential outliers.
Points with Cook’s Distance >1 may be problematic.
What is Ridge Regression?
A regression technique that adds an L2 penalty to shrink coefficients.
Helps in handling multicollinearity.
What is LASSO Regression?
A regression technique that adds a L1 penalty, shrinking some coefficients to zero.
Helps with feature selection.