3_Reducing Loss Flashcards
An Iterative Approach
Iterative learning might remind you of the “Hot and Cold” kid’s game for finding a hidden object like a thimble. In this game, the “hidden object” is the best possible model. You will start with a wild guess (“The value of w1 is 0.”) and wait for the system to tell you what the loss is. Then, you will try another guess (“The value of w1 is 0.5.”) and see what the loss is. Aah, you are getting warmer. Actually, if you play this game right, you will usually be getting warmer. The real trick to the game is trying to find the best possible model as efficiently as possible.
An Iterative Approach
The “model” takes one or more features as input and returns one prediction (y′) as output. To simplify, consider a model that takes one feature and returns one prediction:
y′=b+w1x1
What initial values should we set for b and w1? For linear regression problems, it turns out that the starting values aren’t important. We could pick random values, but we’ll just take the following trivial values instead:
- b = 0
- w1 = 0
Suppose that the first feature value is 10. Plugging that feature value into the prediction function yields:
y′=0+0⋅10=0
The “Compute Loss” part of the diagram is the loss function that the model will use. Suppose we use the squared loss function. The loss function takes in two input values:
- y′: The model’s prediction for features x
- y: The correct label corresponding to features x.
At last, we’ve reached the “Compute parameter updates” part of the diagram. It is here that the machine learning system examines the value of the loss function and generates new values for b and w1. For now, just assume that this mysterious box devises new values and then the machine learning system re-evaluates all those features against all those labels, yielding a new value for the loss function, which yields new parameter values. And the learning continues iterating until the algorithm discovers the model parameters with the lowest possible loss. Usually, you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.
Gradient Descent
Suppose we had the time and the computing resources to calculate the loss for all possible values of w1. For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:
Gradient Descent
Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.
Calculating the loss function for every conceivable value of w1 over the entire data set would be an inefficient way of finding the convergence point. Let’s examine a better mechanism—very popular in machine learning—called gradient descent.
The first stage in gradient descent is to pick a starting value (a starting point) for w1. The starting point doesn’t matter much; therefore, many algorithms simply set w1 to 0 or pick a random value. The following figure shows that we’ve picked a starting point slightly greater than 0:
Gradient Descent
The gradient descent algorithm then calculates the gradient of the loss curve at the starting point. Here in Figure 3, the gradient of the loss is equal to the derivative (slope) of the curve, and tells you which way is “warmer” or “colder.” When there are multiple weights, the gradient is a vector of partial derivatives with respect to the weights.
Note that a gradient is a vector, so it has both of the following characteristics:
- a direction
- a magnitude
The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.
To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point as shown in the following figure:
(When performing gradient descent, we generalize the above process to tune all the model parameters simultaneously. For example, to find the optimal values of both w1 and the bias b, we calculate the gradients with respect to both w1 and b. Next, we modify the values of w1 and b based on their respective gradients. Then we repeat these steps until we reach minimum loss.)
Learning Rate
As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.
Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate. If you pick a learning rate that is too small, learning will take too long.
Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong.
There is a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.
Stochastic Gradient Descent
In gradient descent, a batch is the total number of examples you use to calculate the gradient in a single iteration. So far, we’ve assumed that the batch has been the entire data set. When working at Google scale, data sets often contain billions or even hundreds of billions of examples. Furthermore, Google data sets often contain huge numbers of features. Consequently, a batch can be enormous. A very large batch may cause even a single iteration to take a very long time to compute.
A large data set with randomly sampled examples probably contains redundant data. In fact, redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but enormous batches tend not to carry much more predictive value than large batches.
What if we could get the right gradient on average for much less computation? By choosing examples at random from our data set, we could estimate (albeit, noisily) a big average from a much smaller one. Stochastic gradient descent (SGD) takes this idea to the extreme–it uses only a single example (a batch size of 1) per iteration. Given enough iterations, SGD works but is very noisy. The term “stochastic” indicates that the one example comprising each batch is chosen at random.
Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.
To simplify the explanation, we focused on gradient descent for a single feature. Rest assured that gradient descent also works on feature sets that contain multiple features.