L4 - Other Regression Models, Data Splits, Comparing Models Flashcards
How can we improve on the simple linear regression model?
By using polynomial regression. This provides higher ability of learning from the data.
When using polynomial regression, what do we need to be careful of?
We must ensure that the model is not overfit.
We want to learn from the data, not learn the data.
What is the ‘Order of Polynomial’?
The number of terms in the polynomial regression equation.
What is the relationship between polynomial regression complexity and the order of polynomial?
As the order increases, so does the model complexity.
Is polynomial regression linear?
Yes.
What determines whether the model is non-linear or not?
The order of the polynomial coefficients (Beta).
What is Bayes Information Criterion?
A method for comparing the performance of a model given different parameters.
E.g Model A with 2 params, Model A with 4 params etc.
What does BIC measure?
Complexity and error.
Using BIC, how can we establish the best performing model?
The best performing version of the model will be the one with the lowest BIC score. I.e the lowest on the BIC graph.
What is the purpose of test / train splits?
Partitions the dataset into training and test subsets. We can train the model on the train set, and analyse its performance on the test set.
This indicates how well the model will perform on unseen data.
What is another name for the test set?
Holdout set.
What are the 2 main issues with conducting trains / test splits?
Data Leakage - Test set data can leak into the training set, and influence the training of the model. E.g duplicate data.
Weak Model - If the model performs weakly, we have to re-split the data.
What is the purpose of Cross-Validation?
Protect the model against overfitting.
How does K-fold Cross-Validation work?
- Shuffle dataset into K groups
- For each group, use it as a test set, and the remainder of the data as a training set.
- Perform train-test on all K groups.
- Summarise model performance based on the model error.
What are the 2 sampling methods for balancing datasets?
Random - Can lead to non-equal group sizes.
Stratified - Ensures cross-validation is a close approximation of the generalisation error.