Lecture 6 - Prediction models Flashcards

Question 1

Q

For what is a multiple linear regression used?

Answer

A

To estimate the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Question 2

Q

Generate a multiple linear regression formula for the following question: what is the (joint) influence of sex, age, and BMI on cholesterol?

Answer

A

mean cholesterol = b0 + b1sex + b2age + b3*BMI
* b1: effect of sex (female=0, male=1) on cholesterol for subjects with same age and BMI.
* b2: effect of age (years) on cholesterol for subjects with the same sex and BMI.
* b3: effect of BMI (kg/m2) on cholesterol for subjects with the same sex and age.

Question 3

Q

When running the multiple linear regression in SPSS for the following formula: “mean cholesterol = b0 + b1sex + b2age + b3*BMI”, the regression coefficients (0,408; 0.506; 0,042; 0.108 respectively), 95%CIs and p-values can be used to answer the research question.
Calculate the predicted mean cholesterol level for a female (=0) of 30 years old with a BMI of 25?

Answer

A

mean cholesterol = 0.408 + 0.5060 + 0.04230 + 0.108*25 = 4.368

Question 4

Q

When running a multiple linear regression model, one of the outputs is the R-square. What is R-square and how is it calculated?

Answer

A

R-square is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by (an) independent variable(s) in a regression model. It is calculated by dividing the variance explained by the regression model by the total variance of outcome. Therefore, R-square is always between 0 and 1.

Question 5

Q

How to tackle the following research question:

Does a model with sex, age and BMI predict cholesterol better than a model with only sex?
* Model 1: mean cholesterol = b0 + b1sex + b2age + 0BMI
* Model 2: mean cholesterol = b0 + b1sex + b2age + b3BMI

Answer

A

In order to answer this question, it is important that the models are nested: a model that uses the same variables (and cases) as another model but specifies at least one additional parameter to be estimated.
Subsequently, an F-test can be used to determine whether or not there is a statistically significant difference between a regression model (model 1) and some nested version of the same model (model 2). Here, a significant p-value (that is less than the significance level) for the F-test of the nested model indicates that the variables used for the nested model, provide a better fit dan the model without these additional variables. This is especially true when the R2 also increases for the nested model

Question 6

Q

What assumptions need to be met in order to perform a multiple linear regression analysis?

Answer

A

Linearity: a linear relationship between the dependent variable and each of the independent variables.
Multivariate normality: residuals are normally distributed.
No multicollinearity: independent variables are not highly correlated with each other.
Homoscedasticity: the variance of error is equal/similar across the values of the independent variables.
Normal distribution of e between observed and predicted values

Question 7

Q

Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Linearity

Answer

A

Scatterplots

Question 8

Q

Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Multivariate normality

Answer

A

Q-Q plot of entire model

Question 9

Q

Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* No multicollinearity

Answer

A

Correlation matrix with the use of Pearson’s bivariate correlations among all independent variables. The magnitude of the correlation coefficients should be less than 0.8.
Variance Inflation Factor (VIF): the VIFs of the linear regression indicate the degree that the variances in the regression estimates are increased due to multicollinearity. VIF values higher than 10 indicate that multicollinearity is a problem.

Question 10

Q

Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Homoscedasticity

Answer

A

Plot of standardized residuals vs. predicted values shows whether points are equally distributed across all values of the independent variables. There should be no clear pattern in the distribution.

Question 11

Q

Describe for the following assumption of the multiple linear regression model how the assumption is analysed.
* Normal distribution of e

Answer

A

Plot a Q-Q plot for observed and predicted values for each variable used in the model.

Question 12

Q

To answer the following question: ‘How much of the variance in the development of CHD is explained by BMI, age, alcohol and smoking behavior’, which measures can be used?

Answer

A

Area Under the ROC Curve (AUC) (the higher the better)
Nagelkerke’s R2

Question 13

Q

If you are testing which of two (multiple linear regression) models is a better fit, you need to use a nested model (a model that uses the same variables (and cases) as another model but specifies at least one additional parameter to be estimated). Subsequently, the Likelihood Ratio test can be used to determine the goodness of fit of the two models based on the ratio of their likelihood. Explain how the Likelihood Ratio can be used to determine the goodness of fit of a model.

Answer

A

The Likelihood Ratio test can be found under ‘Omnibus Tests of Coefficients and Model Summary’ and uses the Chi2-test to see if there is a significant difference between the Log-likelihoods of the baseline model (model 1) and the new model (model 2).
The ‘Model Summary’ gives the -2 Log likelihood (-2LL), if model 2 has a significantly reduced -2LL compared to the baseline, it suggest that the new model is explaining more of the variance in the outcome

Question 14

Q

What analysis needs to be performed to answer the following question: “Is the model with BMI, age, alcohol and smoking behavior better than the model with BMI and age in predicting development of CHD?”
* Model 1: ln(odss CHD) = b0 + b1BMI + b2age + 0alcohol + 0smoking
* Model 2: ln(odds CHD) = b0 + b1BMI + b2age + b3alcohol + b4smoking

Answer

A

In order to answer this question, it is important that the models are nested. Subsequently, the Likelihood Ratio test can be used to determine the goodness of fit of the two models based on the ratio of their likelihood.
* The Likelihood Ratio test can be found under ‘Omnibus Tests of Coefficients and Model Summary’ and uses the Chi2-test to see if there is a significant difference between the Log-likelihoods of the baseline model (model 1) and the new model (model 2).
* The ‘Model Summary’ gives the -2 Log likelihood (-2LL), if model 2 has a significantly reduced -2LL compared to the baseline, it suggest that the new model is explaining more of the variance in the outcome

Question 15

Q

Which model is used when there are multiple, potentially interacting covariates?

Answer

A

A multiple cox regression model
* ln(h(t))= ln(h0(t)) + b1x1 +b2x2 + … + bk*xk

Question 16

Q

What should you use to determine the variance explained by a multiple cox regression model and two compare two nested models?

Answer

Study These Flashcards

A

Variance: generalized R2
Nested models: Likelihood Ratio test

Question 17

Q

There are three procedures for variable selection for a multiple regression model. Name and explain these.

Answer

Study These Flashcards

A

Forward: begin with a null model and add varibales sequentially (add variable with the lowest p-value)
Backward“begin with a full model and remove variables sequentially (remove variables with the highest p-values)
Stepwise: augments forward procedure where variables can be removed if not significant anymore.

Lecture 6 - Prediction models Flashcards

(17 cards)