How to validate model quality Flashcards
Train-Test Split
Train-Test Split Evaluation The train-test split is a technique for evaluating the performance of a machine learning algorithm.
It can be used for classification or regression problems and can be used for any supervised learning algorithm.
What happens in Train test split
The procedure involves taking a dataset and dividing it into two subsets.
The first subset is used to fit the model and is referred to as the training dataset.
The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.
Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.
Objective of train-test split
The objective is to estimate the performance of the machine learning model on new data: data not used to train the model
When do we use Train test split
This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.
The train-test procedure is appropriate when there is a sufficiently large dataset available.
How to Configure the Train-Test Split
- The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets.
You must choose a split percentage that meets your project’s objectives with considerations that include:
- Computational cost in training the model.
- Computational cost in evaluating the model.
- Training set representativeness.
- Test set representativeness.
common split percentages include:
Train: 80%, Test: 20%
Train: 67%, Test: 33%
Train: 50%, Test: 50%
Training and Test Data in Python Machine Learning
As we work with datasets, a machine learning model works in two stages.
We usually split the data around 20%-80% between testing and training stages.
Under supervised learning, we split a dataset into a training data and test data in Python ML.