Mult Lin Reg/ Model Select Flashcards
multivariable modeling considerations
o 1.Which variables should be included
o 2.Which observations should be included
o 3.What variable transformations should be done
• 4.What variables might be confounders
• 5.What variables might represent effect modifiers (i.e. require interactions between variables)
• 6.Whether there are there multicollinearity problems
• 7.Whether there is a sample size / sparse data problem
• 8.How missing values will be handled
• 9.Consideration of overfitting
before any multivariable modeling
- 1.Select variables of interest
- 2.Define categories, sometimes more than once, for a given variable
- 3.Examine univariate distributions
- 4.Examine bivariate distributions
- 5.Perform univariate analysis for the primary association of interest, with each potential confounder / effect modifier
- 6.Rethink variables and categories
- 7.Perform multivariable analysis for the primary association of interest with different combinations of potential confounders / effect modifiers
multiple linear regression
Relationship between one dependent and two or more independent variables is a linear function
multiple regression model
- Estimate multiple linear regression equation;
- Test overall significance of model
- Test significance of each independent variables
- Test relative important of each independent variables
- Select best model
- Use model for prediction and estimation
interpretation of regression model
• b0 is the estimate β0, which is the intercept of the regression line and the average Y value when the predictors are zero.
• bk is the estimate βk, which is one of the partial regression coefficient or ‘slopes’ of regression line
o represents change in Y for a unit change in Xk with other predictors held constant.
o i.e., βk is the average slope across all subgroups created by the Xk levels
testing model significance
Tests whether there is a linear relationship between all X variables together and Y
use f test statistic - tests significance
model hypotheses null and alternative
o H0: b1 = b2 = … = bk = 0
• Null: All beta coefficients are zero (excluding the intercept). If any kind of relationship between y and any of our predictors this wont be true. Almost every time we have real dataset reject this hypothesis fast (not even worth reporting that we reject it)
o Ha: At least one beta coefficient is not 0
• At least one X variable is related to Y
linear regression assumptions
o Mean of distribution of error is 0
o Distribution of error has constant variance
o Distribution of error is normal
o Errors are independent
to evaluate whether regression assumptions hold
we estimate the errors. These estimated errors are called residuals.
residual calculation
difference between an observed value of Y and the estimated mean based on the associated X value(s).
residuals useful for
o Diagnostics–techniques for checking assumptions of the regression model
o Understanding the variation in Y that is unexplained by the regression model
o Identifying possible outliers
Residual analysis
• Graphical analysis of residuals
o Plot residuals vs. Xi Values
o Plot residuals vs predicted values ( )
o Plot histogram or stem-and-leaf of residuals
o Q-Q plot of residuals
what if diagnostic plots indicate a problem
• Change the model: o Add or remove variables o Transform variables or recode categorical variables o Remove outliers (but be careful!) • Use a different analytic approach
occam’s razor
- Occam’s Razor: the principle that the simplest explanation is the most plausible unless there is evidence that a more complicated explanation is necessary
- For regression: you want a model with the smallest number of simple predictors that explains the observed data
R squared
Proportion of variation in Y ‘explained’ by all X variables Taken Together.. always increases (or stays same) when a new X variable added to model
simply maximizing r square will
lead to models that vastly overfit data
model building
• Use specified X variables (chosen based on understanding of problem and data)
• Stepwise Regression
o Computer selects X variable most highly correlated With Y
o Continues to add or remove variables depending on SSE (forward selection or backwards elimination)
forward selection
involves starting with no variables in the model, testing the addition of each variable using a chosen model comparison criterion, adding the variable (if any) that improves the model the most, and repeating this process until none improves the model.
backward elimination
Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model comparison criterion, deleting the variable (if any) that improves the model the most by being deleted, and repeating this process until no further improvement is possible
multicollinearity
- High correlation between X variables
- Coefficients measure combined effect
- Leads to unstable coefficients and potentially misleading conclusions
- Example: BMI and weight in the same model
detecting multicollinearity
• Examine Correlation Matrix
o Correlations between pairs of X variables are greater than correlations with Y variable
• look at scatterplots between all pairs of variables
remedies for multicolinearity
Eliminate one correlated X variable