LEC 11b Multiple Linear Regression Flashcards
Assumptions of multiple linear regression (5)
- The observations are independent of one another
- For any specified values of x, the distribution of the y values is normal
- For any set of values of x, the variance is equal
- There is little or no multicollinearity among the independent variables
eg weight and BMI are highly correlated - The relationship among the variables is represented by the equation y = alpha + beta(i)(xi) …
alpha
- y intercept
- mean value of y when all independent variables = 0
beta
- slope
- mean value of y that corresponds to a one-unit change in x(i)
- after controlling for all other independent variables (keeping the values constant)
Multiple linear regression model dimension
- multidimensional (no longer straight line)
How to find the best-fitting model?
Method of least squares
- the model with the smallest residual sum of squares
Can nominal variables be incorporated into regression model?
Yes, using dummy variables
Dummy variables
- categories of the nominal variables are identified using numbers
- numerical values that do not have any quantitative meaning
- coded as 0 or 1
For nominal variable, if there are k categories, what is the number of dummy variables needed?
k-1
How to evaluate goodness-of-fit of regression model
- coefficient of determination (R^2)
- use adjusted R^2 if model contain different numbers of independent variables
Coefficient of determination (R^2) (3)
- can be interpreted as the proportion of variability among the observed values of y that is explained by the linear regression model containing the set of independent variables
- range from 0 to 1
- always increase with inclusion of more independent variable
Adjusted R^2 (3)
- increase when the inclusion of an independent variable improves the ability to predict y
- decrease when the inclusion of an independent variable does not improve the ability to predict y
- cannot be directly interpreted as the proportion of variability among the observed values of y that is explained by the linear regression model
Multiple linear regression
- describes the linear relationship between the dependent variable (Y) and more than 1 independent variable (continuous, ordinal or nominal)
Y = alpha + beta1(x1) + … + betak(xk)
Assumptions of Simple linear regression vs Multiple linear regression
Simple linear regression
1. There is linear relationship between the variables
Y = alpha + beta(x)
2. Each observations are independent of one another
3. For any specified values of X, the distribution of the Y values is normal
4. For any set of values of X, the variance is constant (equal variance)
Multiple linear regression
1. The relationship among the variables is represented by the equation
Y = alpha + beta1(x1) + … + betak(xk)
2. The observations are independent of one another
3. For any specified values of x, the distribution of the y values is normal
4. For any set of values of x, the variance is equal
5. There is little or no multicollinearity among the independent variables (not highly correlated)
eg weight and BMI are highly correlated
Why use adjusted R^2 > normal R^2 when assessing for best fit linear regression model for multiple variables (2)
- accounts for the added complexity of a model
- additional independent variable will always increase R^2, hence it is more meaningful to look at adjusted R^2
Model selection types (2)
- Forward selection
- independent variables are added one at a time with the predictor with the highest correlation with the dependent variable - Backward selection
- all independent variables are added all at once into the equation first and each independent variable is deleted one at a time
- often preferred