Model Selection Flashcards
Prediction & explanation
If aim is to predict future outcomes:
- variables in model = covariates
- predictive model
- variable selection
If aim is to explain / find causal relationships:
- variable in model = explanatory variables
- explanatory models
- NO automatic variable selection strategies allowed
AIC or AICc or BIC
AICc is for low n
AICc must be used when n/p < 40
But is also in general better
BIC penalizes for complexity
In R:
AICc()
StepAIC()
-> coefficients of such models should not be interpreted bc model selection may lead to biased parameter estimates
Workflow for Explanatory, confirmatory model
- clear hypothesis
- select x according to a priori knowledge
- formulate only 1/few models before analysing
Confirmatory vs. Exploratory
Confirmatory:
- Clear hypothesis & a priori selection of regressors for y.
- No variable selection!
- Allowed to interpret the results and draw quantitative conclusions.
Exploratory:
- Build whatever model you want, but the results should only be used to generate new hypotheses, a.k.a. “speculations”.
- Clearly label the results as “exploratory”.
How many variables include in model?
Not more than n/10 (10% of n)
Otherwise overfitting
(Categorical variables with k=3 already use up 2 parameters)
Collinearity of covariates
If can be written as linear combination of others
-> x1 = x2 -> slopecoefficients cannot be uniquely determined
-> sd too high
-> p-values too high
How to detect collinearity
Variance inflator factor (VIF)
VIF = 1 / 1-Rj^2
If Rj^2 big -> high collinearity
What to do against collinearity
- Avoid it
- not include variable in unacceptable high VIF
- be aware of it
- interpret results with care
-> Note: collinearity in predictive model no problem -> AIC eliminates it
Preregistrate in explanatory models
- what transformations i would try
- what model simplifications will be considered
- How i deal with outliers
- How i treat missing values
- how i treat collinear variables
-> analyse data following this protocol - fit model and check if assumptions are met
- if assumptions are not met, adapt the model as outlined in protocol
- Interpret model coefficients and p-values properly
-> any additional analyses: exploratory
Post hoc/ a posteriori variable selection
AIC
BIC
Dredge
Problem with collinearity
The standard errors of the parameter estimates are too large -> thus p-values are too large (conservative)