Regression Flashcards
Regression
A model - a straight line – that predicts every value of a
dependent variable (outcome) given any value of the independent variable(s; predictors)
R^2 (R-squared)
Measure of how much of the total variance is accounted for by the regression model.
R-squared = 1 - SSm/SSr
Adjusted R-squared
Adjusted R-squared penalizes models for number of predictors, computed as:
1 - (((1 - R^2)*(N - 1)) / N - k - 1)
…where N = number of data points, k = number of predictors.
Considered a conservative alternative to R-squared.
Model Sum of Squares = SSM
Difference between linear model and the mean of the data points
Deviation of each data point predicted by the model from the mean.
Residual Sum of Squares = SSR
Deviation between each data point predicted by the model and the actual data points
Total Sum of Squares = SST
Data points deviation from the mean (error in the null model)
F-statistic
Measure of whether improvement of the model relative to the mean is greater than residual error
F = SSm/SSr
What is multiple regression? (with continuous predictors)
A model that predicts some (continuous)
outcome variable, y, from multiple continuous predictors, x.
How can I compute multiple regression with continuous variables?
To compute multiple regression with continuous variables, you’ll fit a regression model where you predict one continuous outcome (dependent variable) using two or more continuous predictors (independent variables).
Null model = intercept only model
It only has an intercept which is the mean of the data points
Assumptions of linear regression
(1) Outcome variable must be continuous (at least at the interval level)
(2) No multicollinearity (i.e., no linear relationship between 2 or more predictors)
(3) Linearity of residuals (i.e., linear relationship between predicted values & residuals)
(4) Normality of residuals (residuals are random and normally distributed with mean 0)
(5) Homoscedasticity (variance of residuals is the same for all data points)
(6) No influential cases (outliers)
(7) Independence of residuals / observations
A residual
Residual = Observed data point – predicted data point
Multicollinearity
Multicollinearity occurs in regression analysis when two or more independent variables are highly correlated with each other. This means they contain overlapping information about the dependent variable, which makes it difficult for the model to estimate the unique effect of each predictor accurately.
Homoscedasticity
In regression, homoscedasticity refers to the idea that the variance of the residuals (differences between observed and predicted values) should be consistent across all levels of the independent variable(s)
Heteroscedasticity
Heteroscedasticity in regression occurs when the variability of the residuals (the differences between observed and predicted values) is not constant across all levels of the independent variable(s). In other words, the spread or “scatter” of residuals changes as the values of the predictor variable(s) change.