Linear Regression & Regression Diagnostics Flashcards

Systematically introducing and reinforce key concepts in Linear Regression, Regression Diagnostics, and Statistical Modeling

1
Q

What is statistical modeling?

A

A process of using mathematical models to represent real-world data relationships.

It helps understand and predict outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Define inferential modeling.

A

The use of statistical techniques to make predictions or inferences about a population based on a sample.

Common in regression analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is linear regression?

A

A statistical method used to model the relationship between a dependent variable and one or more independent variables.

The equation is y = β₀ + β₁X + ε.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is simple linear regression?

A

A linear regression model with only one independent variable.

Example: Predicting house price based on square footage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is multiple linear regression?

A

A linear regression model with two or more independent variables.

Example: Predicting sales based on advertising spend and product price.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the slope in linear regression represent?

A

The rate at which the dependent variable changes with respect to the independent variable.

A positive slope indicates a positive relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Fill in the blank:

The y-intercept (β₀) represents __________.

A

The predicted value of y when x = 0.

Often not meaningful in some contexts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main assumption of simple linear regression?

A

That there is a linear relationship between the dependent and independent variable.

Checked using scatter plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does the error term (ε) in regression represent?

A

The difference between the observed and predicted values.

Also known as residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the coefficient of determination (R²)?

A

A measure of how well the model explains the variance in the dependent variable.

R² values range from 0 to 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is Adjusted R²?

A

A modified R² that adjusts for the number of predictors in the model.

Unlike R², it penalizes adding unnecessary predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is residual analysis in regression?

A

The process of analyzing the differences between observed and predicted values.

Helps detect model issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a normal Q-Q plot used for?

A

Checking if residuals follow a normal distribution.

A straight 45-degree line suggests normality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is homoscedasticity?

A

When residuals have constant variance.

Checked using residual vs. fitted plots.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is heteroscedasticity?

A

When residual variance is not constant.

Indicates a violation of homoscedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you fix heteroscedasticity?

A

Apply transformations like:

  • log, square root
  • use weighted least squares regression.

Non-constant variance can distort regression results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is multicollinearity?

A

When two or more independent variables are highly correlated.

Leads to unstable coefficient estimates.

18
Q

How do you detect multicollinearity?

A

Using Variance Inflation Factor (VIF).

VIF > 5 or 10 suggests high multicollinearity.

19
Q

How can you fix multicollinearity?

A

Remove one of the correlated variables, use PCA, or ridge regression.

Keeping highly correlated features can distort model interpretation.

20
Q

What is the normality assumption in regression?

A

Residuals should be normally distributed.

Checked using Q-Q plots or Shapiro-Wilk test.

21
Q

What is the Durbin-Watson test used for?

A

Detecting autocorrelation in regression residuals.

A value close to 2 suggests no autocorrelation.

22
Q

What is Cook’s Distance?

A

A measure to identify influential outliers.

Points with Cook’s Distance >1 may be problematic.

23
Q

What is Ridge Regression?

A

A regression technique that adds an L2 penalty to shrink coefficients.

Helps in handling multicollinearity.

24
Q

What is LASSO Regression?

A

A regression technique that adds a L1 penalty, shrinking some coefficients to zero.

Helps with feature selection.

25
Q

What is Elastic Net Regression?

A

A mix of Ridge and LASSO regression.

Uses both L1 and L2 penalties.

26
Q

What is the purpose of polynomial regression?

A

To model a non-linear relationship using higher-degree terms.

Example: y = β₀ + β₁X + β₂X² + ε.

27
Q

When should you use logistic regression instead of linear regression?

A

When predicting a binary outcome.

Example: Predicting yes/no responses.

28
Q

What is cross-validation in regression?

A

A method to assess how well the model generalizes to new data.

Common technique: k-fold cross-validation.

29
Q

How can overfitting be prevented in regression?

A

By using regularization techniques like Ridge or LASSO.

Overfitting leads to poor generalization.

30
Q

What is the main goal of regression diagnostics?

A

To verify if model assumptions are met.

Helps improve model reliability.

31
Q

What is the F-test used for in regression?

A

To check if at least one predictor variable is significantly contributing to the model.

A low p-value (<0.05) suggests significance.

32
Q

What does a small p-value for a regression coefficient mean?

A

That the predictor variable is significantly contributing to the model.

Typically, p < 0.05 is considered significant.

33
Q

What is mean squared error (MSE)?

A

The average squared difference between actual and predicted values.

Lower MSE means better model performance.

34
Q

What is root mean squared error (RMSE)?

A

The square root of MSE, measuring average prediction error in the same units as the dependent variable.

More interpretable than MSE.

35
Q

What does it mean if residuals are large?

A

The model’s predictions are not very accurate.

Large residuals suggest potential outliers or a poor model fit.

36
Q

What is the main assumption of the least squares method?

A

That residuals are normally distributed and have constant variance.

Used for estimating regression coefficients.

37
Q

Why should you be cautious about extrapolating in linear regression?

A

The model is only valid within the range of observed data.

Predictions outside this range may be unreliable.

38
Q

What is Bayesian regression?

A

A type of regression that incorporates prior beliefs using probability distributions.

Used when dealing with small datasets or uncertainty.

39
Q

What is robust regression?

A

A regression method that reduces the influence of outliers.

More resistant to violations of normality and homoscedasticity.

40
Q

What is the main advantage of using StatsModels for regression?

A

Provides detailed statistical summaries and diagnostics.

Offers better inference than Scikit-learn.