Lecture 6 - Prediction models Flashcards
For what is a multiple linear regression used?
To estimate the relationship between a quantitative dependent variable and two or more independent variables using a straight line.
Generate a multiple linear regression formula for the following question: what is the (joint) influence of sex, age, and BMI on cholesterol?
mean cholesterol = b0 + b1sex + b2age + b3*BMI
* b1: effect of sex (female=0, male=1) on cholesterol for subjects with same age and BMI.
* b2: effect of age (years) on cholesterol for subjects with the same sex and BMI.
* b3: effect of BMI (kg/m2) on cholesterol for subjects with the same sex and age.
When running the multiple linear regression in SPSS for the following formula: “mean cholesterol = b0 + b1sex + b2age + b3*BMI”, the regression coefficients (0,408; 0.506; 0,042; 0.108 respectively), 95%CIs and p-values can be used to answer the research question.
Calculate the predicted mean cholesterol level for a female (=0) of 30 years old with a BMI of 25?
mean cholesterol = 0.408 + 0.5060 + 0.04230 + 0.108*25 = 4.368
When running a multiple linear regression model, one of the outputs is the R-square. What is R-square and how is it calculated?
R-square is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by (an) independent variable(s) in a regression model. It is calculated by dividing the variance explained by the regression model by the total variance of outcome. Therefore, R-square is always between 0 and 1.
How to tackle the following research question:
Does a model with sex, age and BMI predict cholesterol better than a model with only sex?
* Model 1: mean cholesterol = b0 + b1sex + b2age + 0BMI
* Model 2: mean cholesterol = b0 + b1sex + b2age + b3BMI
- In order to answer this question, it is important that the models are nested: a model that uses the same variables (and cases) as another model but specifies at least one additional parameter to be estimated.
- Subsequently, an F-test can be used to determine whether or not there is a statistically significant difference between a regression model (model 1) and some nested version of the same model (model 2). Here, a significant p-value (that is less than the significance level) for the F-test of the nested model indicates that the variables used for the nested model, provide a better fit dan the model without these additional variables. This is especially true when the R2 also increases for the nested model
What assumptions need to be met in order to perform a multiple linear regression analysis?
- Linearity: a linear relationship between the dependent variable and each of the independent variables.
- Multivariate normality: residuals are normally distributed.
- No multicollinearity: independent variables are not highly correlated with each other.
- Homoscedasticity: the variance of error is equal/similar across the values of the independent variables.
- Normal distribution of e between observed and predicted values
Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Linearity
Scatterplots
Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Multivariate normality
Q-Q plot of entire model
Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* No multicollinearity
- Correlation matrix with the use of Pearson’s bivariate correlations among all independent variables. The magnitude of the correlation coefficients should be less than 0.8.
- Variance Inflation Factor (VIF): the VIFs of the linear regression indicate the degree that the variances in the regression estimates are increased due to multicollinearity. VIF values higher than 10 indicate that multicollinearity is a problem.
Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Homoscedasticity
Plot of standardized residuals vs. predicted values shows whether points are equally distributed across all values of the independent variables. There should be no clear pattern in the distribution.
Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Normal distribution of e
Plot a Q-Q plot for observed and predicted values for each variable used in the model.
To answer the following question: ‘How much of the variance in the development of CHD is explained by BMI, age, alcohol and smoking behavior’, which measures can be used?
- Area Under the ROC Curve (AUC) (the higher the better)
- Nagelkerke’s R2
If you are testing which of two (multiple linear regression) models is a better fit, you need to use a nested model (a model that uses the same variables (and cases) as another model but specifies at least one additional parameter to be estimated). Subsequently, the Likelihood Ratio test can be used to determine the goodness of fit of the two models based on the ratio of their likelihood. Explain how the Likelihood Ratio can be used to determine the goodness of fit of a model.
- The Likelihood Ratio test can be found under ‘Omnibus Tests of Coefficients and Model Summary’ and uses the Chi2-test to see if there is a significant difference between the Log-likelihoods of the baseline model (model 1) and the new model (model 2).
- The ‘Model Summary’ gives the -2 Log likelihood (-2LL), if model 2 has a significantly reduced -2LL compared to the baseline, it suggest that the new model is explaining more of the variance in the outcome
What analysis needs to be performed to answer the following question: “Is the model with BMI, age, alcohol and smoking behavior better than the model with BMI and age in predicting development of CHD?”
* Model 1: ln(odss CHD) = b0 + b1BMI + b2age + 0alcohol + 0smoking
* Model 2: ln(odds CHD) = b0 + b1BMI + b2age + b3alcohol + b4smoking
In order to answer this question, it is important that the models are nested. Subsequently, the Likelihood Ratio test can be used to determine the goodness of fit of the two models based on the ratio of their likelihood.
* The Likelihood Ratio test can be found under ‘Omnibus Tests of Coefficients and Model Summary’ and uses the Chi2-test to see if there is a significant difference between the Log-likelihoods of the baseline model (model 1) and the new model (model 2).
* The ‘Model Summary’ gives the -2 Log likelihood (-2LL), if model 2 has a significantly reduced -2LL compared to the baseline, it suggest that the new model is explaining more of the variance in the outcome
Which model is used when there are multiple, potentially interacting covariates?
A multiple cox regression model
* ln(h(t))= ln(h0(t)) + b1x1 +b2x2 + … + bk*xk