Chapter 4: Training Models Flashcards
What does a “closed form equation” refer too in the context of training machine learning models?
This is an equation that directly computes the parameters of a model that best fits the training data. The parameters that best fit the training data are the model parameters that minimise the cost function over the training set.
What does “gradient descent” refer to in the context of training machine learning models?
Gradient Descent is an iterative optimisation method that gradually tweaks the model parameters to minimise the cost function over the training set, rather than computing them directly.
Simply, how does a linear model, such a linear regression, make a prediction?
By simply computing the weighted sum of the input features, plus a constant called the bias term (intercept).
What is a cost (loss) function?
A loss function is a function that describes how well/poorly a given model fits the data. The aim in training is to minimise this function.
What is are examples of common models and their cost functions?
- Linear Regression: Mean Squared Error Cost function.
- Logistic Regression: Log-loss.
Why may performance/business metrics used to evaluate models differ from their cost function used to train them? Give an example.
Generally, good cost functions are the easiest to optimise, where as performance metrics are as close to a final business object as possible. A good training cost function should be easy to optimise and strongly correlated to the performance metric.
An example of this is logistic regression:
- Trained using log-loss, which is easy to minimise.
- Evaluate using precision and recall, which is related to a business metric.
How does gradient descent work?
- Measures the gradient of the error function, with regard to a parameter vector theta, at a starting point.
- Moves in the direction of the descending gradient.
- Terminates once it has found the point where the gradient is zero.
What does the learning rate refer to in gradient descent?
The learning rate is the size of the steps the algorithm makes per iteration.
How does the a too small learning rate affect performance?
- Too small: The algorithm will have to go through many iterations to converge.
How does the a too large learning rate affect performance?
- Too big: The algorithm may jump across the valley and end up on the other side, possibly to a higher loss than before. This can result in the algorithm diverging and failing to find a good solution.
What can be issues encountered in gradient descent?
- The random initialisation point causes the algorithm to converge to a local minimum, rather than a global minimum.
- The random initialisation point is in a cost-function plateau, causing the algorithm to take a long time to converge to the global minimum, or give up before reaching it.
What is meant by a convex function?
A function where a line that connects two functions on a curve is never below the curve. This implies that there are no local minima, only a global minimum.
In terms of gradient descent, what are the implications of a convex function?
- Gradient Descent is guaranteed to approach arbitrarily closely to the global minimum, given enough time and a correctly sized learning rate.
How does the scale of features affect the cost function?
- Same scale: The cost function has the shape a bowl, gradient descent can go straight towards the centre (the minimum) and converge quickly.
- Different scales: The cost function has the shape of an elongated bowl and the gradient descent algorithm will have to travel in a direction almost orthogonal to the global minimum. This will make it reach convergence very slowly.
When using a polynomial regression, how can you determine which polynomial degree is optimal for the model?
- Use cross validation to compute the generalisation error for different polynomial degrees and compare this for over/under fitting.
- Use a learning curve plot.
What is a learning curve?
Learning curves plot training and validation scores for models trained and evaluated using cross-validation, with increasing training set sizes.
How would you interpret this learning curve plot?
Interpretation: Model underfitting, both curves reach a plateau, which is close and fairly high.
Justification:
Training error:
- The model fits the perfectly when the sample size is small (two points).
- As new instances are added (training sample increases), the training error increases as it cannot fit to these new instances.
- The training error reaches a plateau at which new instances doesn’t increase or decrease performance.
Validation error:
- When the model is trained on few data points, the model does not generalise well on the validation set.
- The model begins to generalise slightly better as training instances increase, resulting in a decreasing validation error.
- However the model is not a good fit for the data and the validation loss reaches a plateau.
How would you interpret this learning curve plot?
Interpretation: Model overfitting, there is a gap between the curves. This means that the model performs significantly better on the training set than the validation set.