Chapter 4: Training Models Flashcards by Thomas P O'Connor

What does a “closed form equation” refer too in the context of training machine learning models?

This is an equation that directly computes the parameters of a model that best fits the training data. The parameters that best fit the training data are the model parameters that minimise the cost function over the training set.

How well did you know this?

Not at all

Perfectly

What does “gradient descent” refer to in the context of training machine learning models?

Gradient Descent is an iterative optimisation method that gradually tweaks the model parameters to minimise the cost function over the training set, rather than computing them directly.

How well did you know this?

Not at all

Perfectly

Simply, how does a linear model, such a linear regression, make a prediction?

By simply computing the weighted sum of the input features, plus a constant called the bias term (intercept).

How well did you know this?

Not at all

Perfectly

What is a cost (loss) function?

A loss function is a function that describes how well/poorly a given model fits the data. The aim in training is to minimise this function.

How well did you know this?

Not at all

Perfectly

What is are examples of common models and their cost functions?

Linear Regression: Mean Squared Error Cost function.
Logistic Regression: Log-loss.

How well did you know this?

Not at all

Perfectly

Why may performance/business metrics used to evaluate models differ from their cost function used to train them? Give an example.

Generally, good cost functions are the easiest to optimise, where as performance metrics are as close to a final business object as possible. A good training cost function should be easy to optimise and strongly correlated to the performance metric.

An example of this is logistic regression:
- Trained using log-loss, which is easy to minimise.
- Evaluate using precision and recall, which is related to a business metric.

How well did you know this?

Not at all

Perfectly

How does gradient descent work?

Measures the gradient of the error function, with regard to a parameter vector theta, at a starting point.
Moves in the direction of the descending gradient.
Terminates once it has found the point where the gradient is zero.

How well did you know this?

Not at all

Perfectly

What does the learning rate refer to in gradient descent?

The learning rate is the size of the steps the algorithm makes per iteration.

How well did you know this?

Not at all

Perfectly

How does the a too small learning rate affect performance?

Too small: The algorithm will have to go through many iterations to converge.

How well did you know this?

Not at all

Perfectly

How does the a too large learning rate affect performance?

Too big: The algorithm may jump across the valley and end up on the other side, possibly to a higher loss than before. This can result in the algorithm diverging and failing to find a good solution.

How well did you know this?

Not at all

Perfectly

What can be issues encountered in gradient descent?

The random initialisation point causes the algorithm to converge to a local minimum, rather than a global minimum.
The random initialisation point is in a cost-function plateau, causing the algorithm to take a long time to converge to the global minimum, or give up before reaching it.

How well did you know this?

Not at all

Perfectly

What is meant by a convex function?

A function where a line that connects two functions on a curve is never below the curve. This implies that there are no local minima, only a global minimum.

How well did you know this?

Not at all

Perfectly

In terms of gradient descent, what are the implications of a convex function?

Gradient Descent is guaranteed to approach arbitrarily closely to the global minimum, given enough time and a correctly sized learning rate.

How well did you know this?

Not at all

Perfectly

How does the scale of features affect the cost function?

Same scale: The cost function has the shape a bowl, gradient descent can go straight towards the centre (the minimum) and converge quickly.
Different scales: The cost function has the shape of an elongated bowl and the gradient descent algorithm will have to travel in a direction almost orthogonal to the global minimum. This will make it reach convergence very slowly.

How well did you know this?

Not at all

Perfectly

When using a polynomial regression, how can you determine which polynomial degree is optimal for the model?

Use cross validation to compute the generalisation error for different polynomial degrees and compare this for over/under fitting.
Use a learning curve plot.

How well did you know this?

Not at all

Perfectly

What is a learning curve?

Learning curves plot training and validation scores for models trained and evaluated using cross-validation, with increasing training set sizes.

How well did you know this?

Not at all

Perfectly

How would you interpret this learning curve plot?

Interpretation: Model underfitting, both curves reach a plateau, which is close and fairly high.

Justification:
Training error:
- The model fits the perfectly when the sample size is small (two points).
- As new instances are added (training sample increases), the training error increases as it cannot fit to these new instances.
- The training error reaches a plateau at which new instances doesn’t increase or decrease performance.

Validation error:
- When the model is trained on few data points, the model does not generalise well on the validation set.
- The model begins to generalise slightly better as training instances increase, resulting in a decreasing validation error.
- However the model is not a good fit for the data and the validation loss reaches a plateau.

How well did you know this?

Not at all

Perfectly

How would you interpret this learning curve plot?

Interpretation: Model overfitting, there is a gap between the curves. This means that the model performs significantly better on the training set than the validation set.

How well did you know this?

Not at all

Perfectly

What does bias refer to in machine learning? Will a high-bias model be likely to under or over fit?

Study These Flashcards

This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic.

A high-bias model is most likely to underfit the training data.

What does variance refer to in machine learning? Will a high-variance model be likely to under or over fit?

Study These Flashcards

This part is due to the model’s excessive sensitivity to small variations in the training data.

A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance and thus overfit the training data.

What does irreducible error refer to in machine learning?

Study These Flashcards

This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).

What is the bias/variance trade-off?

Study These Flashcards

Increasing model complexity will reduce bias but increase variance and vice versa. These two characteristics of the model have to be traded-off when considering performance.

What does regularisation refer to? List two common models and how they can be regularised.

Study These Flashcards

Regularisation is a way of constraining a model to reduce overfitting.

Linear model: Constrain the weights of the model.
Polynomial model: Reduce the number of polynomial degrees.

What is Ridge Regression?

Study These Flashcards

Ridge Regression refers to a linear regression model which has a regularisation term added to the cost function that forces the model to keep the weights as small as possible.

What is LASSO regression?

Least Absolute Shrinkage and Selection Operator Regression (LASSO) is the same as Ridge Regression, however it uses L1 normalisation of the weight vector as opposed to L2 (Ridge).

What is an important characteristic of LASSO regression?

It tends to eliminate the weights of least important features by setting the to zero. LASSO regression automatically performs feature selection and outputs a sparse model with few non-zero weights.

What is Elastic Net Regression?

Elastic Net Regression is a middle ground between Ridge and LASSO, where the regularisation term on the cost function is the weighted sum of the Ridge and LASSO regularisation terms. A parameter r is used to control the weight of the regularisation terms: Elastic Net = Ridge | r = 0. Elastic Net = LASSO | r = 1.

When deciding which variety of regression to use, what should be considered?

- It is always preferable to have some degree of regularisation, therefore Ridge should be the default. - If you suspect that there are only a few useful features, Elastic Net or LASSO are preferable as they reduce the feature weights of useless features to zero. - Elastic Net is generally preferred over LASSO as LASSO may behave erratically when: 1) The number of features is greater than the number of training instances. 2) There are strongly correlated features.

In the context of the model regularisation, what does Early Stopping Refer too?

Early Stopping refers to the process where by an iterative learning algorithm is stopped when the validation error reaches a minimum.

What is the Early Stopping Method?

1) Epoch goes by and as the model learns, the generalisation error on the validation set goes down. 2) The model reaches a point where learning stops increasing each epoch and the validation error stops decreasing and starts increasing. This is a sign of overfitting. 3) Early Stopping stops the training of the algorithm when the validation error is at a minimum.

What is a logistic regression used for?

Classifying if a given instance is a member of certain class or not.

How does a logistic regression determine if an instance is in a certain class or not?

A logistic regression estimates the probability that an instance belongs to a particular class, given a set of features. If the estimated probability an instance belongs to a class is above a given threshold (typically 50%) then then model predicts that instance is a member of the positive class.

How does the logistic regression estimate the probability of belonging to a given class?

The logistic regression computes a weighted sum of input features (plus a bias term), but instead of outputting the result directly as with a linear regression, it outputs the logistic of the result. The logistic of the result corresponds to the probability of belonging to the class.

What is the logistic function?

A logistic function, or sigmoid function, is an S shaped function that outputs a number between 0 and 1.

How is a logistic regression training to best fit data?

The objective of training is to set the parameter vector, theta, so that the model estimates high probabilities for positive instances and low probabilities for negative instances.

What is the cost function for a single training instance in a logistic regression?

- Positive instances: -log(p) is being minimised. - Negative instances: -log(1 - p) is being minimised.

What is the cost function for the logistic regression across the whole dataset named and what is its form?

Log loss. This is the average of the cost function of single training instances across all of the instances.

What assumption does using log-loss as cost function rely on?

That instances follow a Guassian distribution around the mean of their class. The more the data deviates from this assumption, the more biased that model will be.

Is log-loss minimised used a closed form equation or using gradient descent?

There is no known closed form equation for minimising log-loss, therefore gradient descent must be used. Luckily, it is a convex function.

What does a decision boundary refer to?

A decision boundary is a point where the probability of belonging to two or more classes is equal.

How can logistic regressions be regularised?

Using L1 and L2 penalties on the cost function.

What is softmax regression?

A variety of logistic regression that can support multiple classes directly.

What is the cost function used for softmax regression?

The cost function used for softmax regression is the cross entropy.

What is cross-entropy?

Cross entropy is a function that calculates the difference between the true distribution of a label compared to the predicted probability.

How is cross entropy used as cost function?

- A probability is computed for a given class label. - This probability is compared to the probability of the class in the true distribution (e.g. 100% in a supervised task) using the equation for cross entropy. - This function is minimised by finding the set of features which minimise this function.

Chapter 4: Training Models Flashcards

(45 cards)