Lecture 9 Flashcards
What are the Two ways to control confounding effects
Design-based methods : Randomization!
* ‘Strong control’ of ‘any confounding’
* Causal claims (IV → DV) can be made
Model-based methods: GLM
* Add potential confounders to GLM as IVs
* Strong assumptions: you’ve collected all confounding variables.
* Causal claims (IV → DV or DV → IV) cannot be made.
* Still, it will be more robust to Pearson correlation without adjusting for confounders
What are Residuals
remainder of DV with the covariate effect removed
What is Partial correlation
Compute and test for the Pearson correlation of the noiseDV and noiseIV
After removing (‘regressing out’) the confounding effect from both IV and DV and testing their correlations, it becomes insignificant
Q: Can we just put all possible confounders in GLM?
Conceptually, yes, as long as you have enough sample size. Adding 100 covariates would be okay if you have 5000 subjects, but problematic if you have 150 subjects (loss of power, overfitting, …). That’s why a design-based control (randomization) is so useful. More important is to use your domain knowledge to pre-select potentially critical confounders.
Q: If I know all confounding variables, then does GLM/partial correlation control these perfectly?
Still no, because you ‘assume’ that the confounders affect DV and IV only linearly, which isn’t necessarily true
What is R^2
R2 is the coefficient of determination. It is the squared correlation between the DV and predicted DV. It explains the randomness of the data
It is interpreted by the proportion of variability of DV explained by IVs. (0 ≤ R2 ≤ 1). As more IVs are added, the value of R2 gets closer to 1
What is Overfitting?
Overfitting is a phenomenon that a complex (statistical) model fits observed data too closely to predict unobserved data. So that the model is good for data, not population.
* In multiple linear regression, adding any covariates would lead to an increase in R2
* Increase in R2 does not guarantee that it is a better model to be applied to unobserved data
What is R^2 vs adjusted R^2
R2 increases with additions of any IVs (even when there is no relationship with DV). Adjusted R2 adjusts for the total number of IVs in the model.
What does Information criteria consider? What value of IC makes a better model?
Information criteria considers both (i) goodness of fit and (ii) the number of IVs (model complexity).
The lower the IC, the better the model
What are the two types of IC?
AIC and BIC
When is AIC preffered?
AIC: Akaike Information Criterion - Preferred when you care more about predictive performances
When is BIC preffered?
BIC: Bayesian Information Criterion - Preferred when you care more about variable selection
What are the biased inferences of p-based methods?
The resulting R2 is inflated.
It does not control false positives.
These methods suffer from the issue of “double dipping” that uses p value twice: once for variable selection and once again for inference
Define Cross validation
Cross validation: a technique to partition the original data into several non-overlapping sub-data sets and evaluate the model’s predictive performances.
The key idea of CV is that your test data is never touched before you fit your model to the training data. For each Fold, fit each model using a training set and evaluate the model performance on the test set. Compute the average performance across folds for each model to decide a better model.