Mult Lin Reg/ Model Select Flashcards
multivariable modeling considerations
o 1.Which variables should be included
o 2.Which observations should be included
o 3.What variable transformations should be done
• 4.What variables might be confounders
• 5.What variables might represent effect modifiers (i.e. require interactions between variables)
• 6.Whether there are there multicollinearity problems
• 7.Whether there is a sample size / sparse data problem
• 8.How missing values will be handled
• 9.Consideration of overfitting
before any multivariable modeling
- 1.Select variables of interest
- 2.Define categories, sometimes more than once, for a given variable
- 3.Examine univariate distributions
- 4.Examine bivariate distributions
- 5.Perform univariate analysis for the primary association of interest, with each potential confounder / effect modifier
- 6.Rethink variables and categories
- 7.Perform multivariable analysis for the primary association of interest with different combinations of potential confounders / effect modifiers
multiple linear regression
Relationship between one dependent and two or more independent variables is a linear function
multiple regression model
- Estimate multiple linear regression equation;
- Test overall significance of model
- Test significance of each independent variables
- Test relative important of each independent variables
- Select best model
- Use model for prediction and estimation
interpretation of regression model
• b0 is the estimate β0, which is the intercept of the regression line and the average Y value when the predictors are zero.
• bk is the estimate βk, which is one of the partial regression coefficient or ‘slopes’ of regression line
o represents change in Y for a unit change in Xk with other predictors held constant.
o i.e., βk is the average slope across all subgroups created by the Xk levels
testing model significance
Tests whether there is a linear relationship between all X variables together and Y
use f test statistic - tests significance
model hypotheses null and alternative
o H0: b1 = b2 = … = bk = 0
• Null: All beta coefficients are zero (excluding the intercept). If any kind of relationship between y and any of our predictors this wont be true. Almost every time we have real dataset reject this hypothesis fast (not even worth reporting that we reject it)
o Ha: At least one beta coefficient is not 0
• At least one X variable is related to Y
linear regression assumptions
o Mean of distribution of error is 0
o Distribution of error has constant variance
o Distribution of error is normal
o Errors are independent
to evaluate whether regression assumptions hold
we estimate the errors. These estimated errors are called residuals.
residual calculation
difference between an observed value of Y and the estimated mean based on the associated X value(s).
residuals useful for
o Diagnostics–techniques for checking assumptions of the regression model
o Understanding the variation in Y that is unexplained by the regression model
o Identifying possible outliers
Residual analysis
• Graphical analysis of residuals
o Plot residuals vs. Xi Values
o Plot residuals vs predicted values ( )
o Plot histogram or stem-and-leaf of residuals
o Q-Q plot of residuals
what if diagnostic plots indicate a problem
• Change the model: o Add or remove variables o Transform variables or recode categorical variables o Remove outliers (but be careful!) • Use a different analytic approach
occam’s razor
- Occam’s Razor: the principle that the simplest explanation is the most plausible unless there is evidence that a more complicated explanation is necessary
- For regression: you want a model with the smallest number of simple predictors that explains the observed data
R squared
Proportion of variation in Y ‘explained’ by all X variables Taken Together.. always increases (or stays same) when a new X variable added to model