Lecture 9 Flashcards

Question 1

Q

What are the Two ways to control confounding effects

Answer

A

Design-based methods : Randomization!
* ‘Strong control’ of ‘any confounding’
* Causal claims (IV → DV) can be made

Model-based methods: GLM
* Add potential confounders to GLM as IVs
* Strong assumptions: you’ve collected all confounding variables.
* Causal claims (IV → DV or DV → IV) cannot be made.
* Still, it will be more robust to Pearson correlation without adjusting for confounders

Question 2

Q

What are Residuals

Answer

A

remainder of DV with the covariate effect removed

Question 3

Q

What is Partial correlation

Answer

A

Compute and test for the Pearson correlation of the noiseDV and noiseIV

After removing (‘regressing out’) the confounding effect from both IV and DV and testing their correlations, it becomes insignificant

Question 4

Q

Q: Can we just put all possible confounders in GLM?

Answer

A

Conceptually, yes, as long as you have enough sample size. Adding 100 covariates would be okay if you have 5000 subjects, but problematic if you have 150 subjects (loss of power, overfitting, …). That’s why a design-based control (randomization) is so useful. More important is to use your domain knowledge to pre-select potentially critical confounders.

Question 5

Q

Q: If I know all confounding variables, then does GLM/partial correlation control these perfectly?

Answer

A

Still no, because you ‘assume’ that the confounders affect DV and IV only linearly, which isn’t necessarily true

Question 6

Q

What is R^2

Answer

A

R2 is the coefficient of determination. It is the squared correlation between the DV and predicted DV. It explains the randomness of the data

It is interpreted by the proportion of variability of DV explained by IVs. (0 ≤ R2 ≤ 1). As more IVs are added, the value of R2 gets closer to 1

Question 7

Q

What is Overfitting?

Answer

A

Overfitting is a phenomenon that a complex (statistical) model fits observed data too closely to predict unobserved data. So that the model is good for data, not population.
* In multiple linear regression, adding any covariates would lead to an increase in R2
* Increase in R2 does not guarantee that it is a better model to be applied to unobserved data

Question 8

Q

What is R^2 vs adjusted R^2

Answer

A

R2 increases with additions of any IVs (even when there is no relationship with DV). Adjusted R2 adjusts for the total number of IVs in the model.

Question 9

Q

What does Information criteria consider? What value of IC makes a better model?

Answer

A

Information criteria considers both (i) goodness of fit and (ii) the number of IVs (model complexity).

The lower the IC, the better the model

Question 10

Q

What are the two types of IC?

Answer

A

AIC and BIC

Question 11

Q

When is AIC preffered?

Answer

A

AIC: Akaike Information Criterion - Preferred when you care more about predictive performances

Question 12

Q

When is BIC preffered?

Answer

A

BIC: Bayesian Information Criterion - Preferred when you care more about variable selection

Question 13

Q

What are the biased inferences of p-based methods?

Answer

A

The resulting R2 is inflated.
It does not control false positives.
These methods suffer from the issue of “double dipping” that uses p value twice: once for variable selection and once again for inference

Question 14

Q

Define Cross validation

Answer

A

Cross validation: a technique to partition the original data into several non-overlapping sub-data sets and evaluate the model’s predictive performances.

The key idea of CV is that your test data is never touched before you fit your model to the training data. For each Fold, fit each model using a training set and evaluate the model performance on the test set. Compute the average performance across folds for each model to decide a better model.

Lecture 9 Flashcards

(14 cards)