Lecture 9 Flashcards

1
Q

What are the Two ways to control confounding effects

A

Design-based methods : Randomization!
* ‘Strong control’ of ‘any confounding’
* Causal claims (IV → DV) can be made

Model-based methods: GLM
* Add potential confounders to GLM as IVs
* Strong assumptions: you’ve collected all confounding variables.
* Causal claims (IV → DV or DV → IV) cannot be made.
* Still, it will be more robust to Pearson correlation without adjusting for confounders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are Residuals

A

remainder of DV with the covariate effect removed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Partial correlation

A

Compute and test for the Pearson correlation of the noiseDV and noiseIV

After removing (‘regressing out’) the confounding effect from both IV and DV and testing their correlations, it becomes insignificant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Q: Can we just put all possible confounders in GLM?

A

Conceptually, yes, as long as you have enough sample size. Adding 100 covariates would be okay if you have 5000 subjects, but problematic if you have 150 subjects (loss of power, overfitting, …). That’s why a design-based control (randomization) is so useful. More important is to use your domain knowledge to pre-select potentially critical confounders.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Q: If I know all confounding variables, then does GLM/partial correlation control these perfectly?

A

Still no, because you ‘assume’ that the confounders affect DV and IV only linearly, which isn’t necessarily true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is R^2

A

R2 is the coefficient of determination. It is the squared correlation between the DV and predicted DV. It explains the randomness of the data

It is interpreted by the proportion of variability of DV explained by IVs. (0 ≤ R2 ≤ 1). As more IVs are added, the value of R2 gets closer to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Overfitting?

A

Overfitting is a phenomenon that a complex (statistical) model fits observed data too closely to predict unobserved data. So that the model is good for data, not population.
* In multiple linear regression, adding any covariates would lead to an increase in R2
* Increase in R2 does not guarantee that it is a better model to be applied to unobserved data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is R^2 vs adjusted R^2

A

R2 increases with additions of any IVs (even when there is no relationship with DV). Adjusted R2 adjusts for the total number of IVs in the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does Information criteria consider? What value of IC makes a better model?

A

Information criteria considers both (i) goodness of fit and (ii) the number of IVs (model complexity).

The lower the IC, the better the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two types of IC?

A

AIC and BIC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When is AIC preffered?

A

AIC: Akaike Information Criterion - Preferred when you care more about predictive performances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When is BIC preffered?

A

BIC: Bayesian Information Criterion - Preferred when you care more about variable selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the biased inferences of p-based methods?

A

The resulting R2 is inflated.
It does not control false positives.
These methods suffer from the issue of “double dipping” that uses p value twice: once for variable selection and once again for inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define Cross validation

A

Cross validation: a technique to partition the original data into several non-overlapping sub-data sets and evaluate the model’s predictive performances.

The key idea of CV is that your test data is never touched before you fit your model to the training data. For each Fold, fit each model using a training set and evaluate the model performance on the test set. Compute the average performance across folds for each model to decide a better model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly