Module 2: Chapter 7 - Supervised Learning - Model Estimation Flashcards
What is Non-Linear Least Squares?
OLS can be used in situations where the underlying model is linear in the parameters, but this does not apply to many machine learning models, such as neural networks. In these cases, a more flexible approach is needed. Nonlinear least squares (NLS) is an approach that can be used when the model is nonlinear, and it works using the same principles as OLS – i.e., by minimizing the residual sum of squares
But in this case, y hat sub i equals f of open paren x sub 1 comma x sub 2 comma dot dot dot comma x sub m semicolon normal w close paren where f can be any nonlinear function of the m explanatory variables or features, which are denoted by xi, and the corresponding parameters are denoted by wi (also known as weights in the case of neural networks). Similarly, mean squared error (MSE) can also be calculated. Because the relationship between the features and the output could in principle take any form, it is often not possible to derive a set of closed form solutions to this minimization problem. Therefore, NLS usually uses a numerical approach to finding the optimal parameter estimates
Which steps are taken to identify the optimal solution of NLS?
(1) Begin with a set of initial values for the parameters – these could either be randomly generated or ‘best preliminary guesses.’
(2) Evaluate the objective function (RSS or MSE).
(3) Modify the parameter estimates and re-evaluate the objective function.
(4) If the improvement in the objective function is below a pre-specified threshold, then stop looking and report the current parameter values as the chosen ones. If not, return to step 2.
The third of the above steps is the crucial aspect, and usually, a gradient descent algorithm is employed, which is discussed in a following sub-section after a preliminary discussion of hill climbing
What is Hill Climbing?
A simple form of optimizer for estimating the parameters of a nonlinear model is hill climbing. This involves starting with initial guesses of each parameter and then making small changes in both directions to each parameter one-at-a-time. The aim is to maximize the value of an objective function (for instance, increasing the value of a likelihood function or increasing the negative of the RSS) until no further improvement in its value is observed.
Hill climbing is very straightforward because it does not require the calculation of derivatives, and therefore it can be applied to non-differentiable functions. It is also simple to implement and for this reason it is sometimes termed a “heuristic optimizer”
What are the disadvantages of Hill Climbing?
(1) Of all the optimization techniques available, hill climbing is the most susceptible to getting stuck in local optima.
(2) Convergence to the optimal solution can be very slow.
(3) Only one parameter can be adjusted at a time, meaning that it is easy for the algorithm to miss optimal parameter combinations, particularly for complex and highly interconnected models
What is the Gradient Descent Method?
A popular numerical procedure for parameter estimation is known as gradient descent.
In this method, the objective function, for example the residual sum of squares, is minimized. Suppose that all the parameters to be estimated are stacked into a single vector W and the objective function in this case is known as the loss function and is denoted as L(W). At each iteration, the algorithm chooses the path of steepest descent (“slope”), which is the one that will minimize the value of the loss function the most. The method works similarly when the maximum likelihood method is used, in which case the negative of the log-likelihood function is minimized.
Note that adjustment to weight wi takes place in the opposite direction to the gradient (or “slope”), and hence the negative sign in the equation 7.8 above for updating the weight.
To avoid the iteration going into infinite cycles, a maximum number of iterations is assigned to the algorithm. In machine learning parlance, each iteration, comprising calculating the loss function and gradient then adjusting the weights using the whole training data sample, is known as an epoch.
What alternatives ways to implement the gradient descent approach exist?
We usually use the entire training data sample and minimize the loss function with respect to all of it, which is known as batch gradient descent.
A slightly less common alternative approach is to apply gradient descent to each data point at a time, individually selected at random from the training set. This is known as stochastic gradient descent.
Alternatively, applying it to subsets of the training data is known as mini-batch gradient descent.
What is the advantage of using stochastic gradient descent?
The benefit of stochastic gradient descent is that it does not require the entire database to be loaded into memory simultaneously, which reduces the required computational resources compared with the batch approach. However, because updating the weights will take place after the algorithm sees each new data point, the convergence on the optimal values will require more iterations and will be less smooth.
For what is the hyperparameter η used in the gradient descent approach?
The parameter η is a hyperparameter that defines how much adjustment to weights takes place. This hyperparameter must be chosen judiciously: if it is too small, each iteration will yield only modest improvements in the loss function and will result in a slow movement towards the optimum, requiring many iterations in total. On the other hand, if η is too large, there is a danger of overshooting and an erratic path towards the optimal w sub i raised to the asterisk power. The algorithm can overshoot the minimum, oscillate around it, or may not converge.
For batch gradient descent, η is fixed a priori, but for stochastic gradient descent, the learning rate can be made a diminishing function of the number of epochs that have already occurred so that the weight updating slows down as the algorithm comes closer to the optimum. In other words, we could employ dynamic learning, which entails starting with a larger η to get close to the optimal solution faster, but to then reduce η as the learning proceeds to avoid overshooting.
What is a problem associated with the gradient descent approach?
One such issue occurs when the function does not have a single, global optimum but rather a series of local optima or an extended plateau away from the true optimum, as illustrated in the right panel of Figure 7.3. In such circumstances, the optimizer can get stuck at the local optimum or plateau and never reach the optimal solution w sub i raised to the asterisk power
What is backpropagation?
Determining the optimal weights in a neural network model is particularly challenging because, even with a single hidden layer, the output is a function of a function. It is like running a logistic regression on the output from another logistic regression.
Therefore, a technique known as backpropagation is used along with gradient descent to determine the weights in neural network models.
How does backpropagation work?
The backpropagation algorithm involves starting on the right-hand side of the neural network (i.e., beginning with the output layer) and then successively working backward through the layers to update the weights estimates at each iteration. This begins by calculating the errors (actual – fitted values) for each target data point, then these errors are “assigned” to each of the weights in the layer before it.
Gradient descent can then be applied to the weights to calculate improved values. The output layer error in the target is determined via a feedforward of the feature values with the updated weights, and the process continues again. The derivatives are computed starting from the output layer and moving backward, with an application of the chain rule (hence, the name backpropagation). The algorithm stops when convergence is achieved – that is, when updating the weights no longer reduces the cost function (or reduces it by a trivially small amount). The key to backpropagation is to consider each layer separately rather than trying to do all the computation in a single step because breaking it down in this way greatly simplifies the mathematics.
What are the steps to identify the optimal weights in ANN?
(1) Generate initial guesses (usually at random) for all the weights, including the biases.
(2) Given these weights, feedforward the values of the inputs to calculate the values at each neuron and then finally the value of the outputs. This is done separately for each of the N data points in the training sample.
(3) Once the fitted values of the outputs are determined, the error, which is the difference between the network output and the actual value, can be calculated for each observation.
(3.a.) f this is the first iteration, proceed to step 4.
(3.b) If the residual sum of squares is below a particular threshold or has not improved much since the previous iteration, or the number of iterations has reached a pre-specified maximum value, stop and fix the weights at their current values. Otherwise, proceed to step 4.
(4) During the backward pass, the gradient descent method is used to calculate improved values of the weights. In this process the error is propagated through the network, and the weights are updated to minimize the loss function. Return to step 2 and run through a further iteration
Why do we include a momentum term and how can it help with local minima?
Sometimes a momentum term is added to the optimizer, which increases the learning rate if the previous change in the weights and the current change are in the same direction but reduces it if they are in the opposite directions.
The benefit of this approach is that it speeds up convergence but at the same time reduces the probability of overshooting the optimal parameter values.
μ is the momentum rate, which can be chosen between 0 and 1. The parameter μ controls how much of the previous weight change we will keep in the next iteration.
This works by ‘overshooting’ the target, which helps prevent the algorithm from getting stuck in local minima.
What is the issue raised by the term vanishing gradient?
deep networks are plagued by another computational issue: the so-called vanishing gradient. Loosely, the problem can be understood as follows. Backpropagation is an application of the chain rule, which entails the multiplication of several derivatives (as many as the layers in the network). When the derivatives are numbers between 0 and 1 (which is always the case for the logistic function), their product tends to become very small very quickly. An opposite problem is an exploding gradient, where the product of the derivatives becomes larger and larger.
When one of these problems emerges, the only way to find the optimum is by using an extremely large number of small updates, which of course makes the learning very slow.
What solutions exist to tackle vanishing/exploding gradients?
(1) An appropriate choice of activation function. In recent years, the use of the logistic function (sigmoid) as an activation function for hidden layers has been abandoned in favor, for instance, of the ReLU function that is less prone to the vanishing gradient problem.
(2) Batch normalization. This consists of adding ‘normalization layers’ between the hidden layers. The normalization works in a fashion like what we discussed in Chapter 1, where the features were normalized by subtracting the mean and dividing by the standard deviation. Here, it is the new inputs originating from the hidden layers that are normalized.
(3) Specific network architectures. Some network architectures have been developed to be resistant to the vanishing or exploding gradient problem. An example is the long short-term memory (LSTM) network