Ch4 Training a Digit Classifier End-of-Chapter Questions Flashcards
What is a mini-batch?
A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch)`
What is a “forward pass”?
Applying the model to some input and computing the predictions.
How is a greyscale image represented on a computer? How about a color image?
Greyscale images are represented by a matrix of pixel intensities with 0 representing white and 255 representing black. Shades of grey are numbers in between. A color image is represented by a set of 3 matrices, each representing a pixel’s intensity of red, green, and blue.
What is a rank-3 tensor?
It’s a tensor with 3 axes or dimensions
What is the difference between tensor rank and shape? How do you get the rank from the shape?
Tensor rank refers to the number of axes in a tensor. Tensor shape refers to the length of each axis. You can get the rank from the shape by using the length function: len(tensorname.shape)
What are RMSE and L1 norm?
RMSE - root mean square error. Take the mean of the squared difference and then take the square root of the answer. It penalizes bigger mistakes more heavily than L1 norm.
L1 norm - mean absolute difference. Take the mean of the absolute value of differences
What is broadcasting?
When trying to do a mathematical operation between tensors of different ranks, broadcasting expands the tensor with the smaller rank to have the same size as the one with the larger rank.
What is SGD?
Stochastic Gradient Descent. It is an iterative algorithm that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.
Why does SGD use mini-batches?
Instead of calculating the loss over all data items, which would take forever, SGD calculates the loss over a portion of data items at a time. This increases computational speed because it reduces the number of required calculations (derivatives).
What are the 7 steps for SGD in machine learning?
7 Steps for SGD:
- Initialize the weights
- Make predictions using model with these weights
- Calculate the loss based on these predictions
- Calculate the gradient, which measures for each weight how changing that weight would change the loss.
- Step (change) all the weights based on that calculation
- Go back to step 2 and repeat the process
- Stop the training process
How do we initialize the weights in a model?
Use random numbers.
What is loss?
Loss is value that represents how well (or badly) our model is doing.
What can’t we always use a high learning rate?
A learning rate that is too high results in large steps that may miss the minimum loss, leading to the loss getting worse or bouncing around rather than converging at the minimum.
What is a gradient?
A gradient is a derivative of the loss with respect to a parameter of the model.
Why can’t we use accuracy as a loss function?
The gradient of a function is its slope–how much the value of the function changes divided by how much we changed the input:
(ynew - yold)/(xnew - xold)
The problem is that a small change in weight (x) isn’t likely to cause the prediction to change, so (ynew - yold) will almost always be 0, i.e., the gradient is 0 almost everywhere. Thus, a small change in weight will often not change the accuracy at all. If the gradient is 0, the model can’t learn from that step. We need a function that can show differences from small changes in weights.