Ch4 Training a Digit Classifier End-of-Chapter Questions Flashcards by Estefy Fiallos

What is a mini-batch?

A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch)`

How well did you know this?

Not at all

Perfectly

What is a “forward pass”?

Applying the model to some input and computing the predictions.

How well did you know this?

Not at all

Perfectly

How is a greyscale image represented on a computer? How about a color image?

Greyscale images are represented by a matrix of pixel intensities with 0 representing white and 255 representing black. Shades of grey are numbers in between. A color image is represented by a set of 3 matrices, each representing a pixel’s intensity of red, green, and blue.

How well did you know this?

Not at all

Perfectly

What is a rank-3 tensor?

It’s a tensor with 3 axes or dimensions

How well did you know this?

Not at all

Perfectly

What is the difference between tensor rank and shape? How do you get the rank from the shape?

Tensor rank refers to the number of axes in a tensor. Tensor shape refers to the length of each axis. You can get the rank from the shape by using the length function: len(tensorname.shape)

How well did you know this?

Not at all

Perfectly

What are RMSE and L1 norm?

RMSE - root mean square error. Take the mean of the squared difference and then take the square root of the answer. It penalizes bigger mistakes more heavily than L1 norm.

L1 norm - mean absolute difference. Take the mean of the absolute value of differences

How well did you know this?

Not at all

Perfectly

What is broadcasting?

When trying to do a mathematical operation between tensors of different ranks, broadcasting expands the tensor with the smaller rank to have the same size as the one with the larger rank.

How well did you know this?

Not at all

Perfectly

What is SGD?

Stochastic Gradient Descent. It is an iterative algorithm that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.

How well did you know this?

Not at all

Perfectly

Why does SGD use mini-batches?

Instead of calculating the loss over all data items, which would take forever, SGD calculates the loss over a portion of data items at a time. This increases computational speed because it reduces the number of required calculations (derivatives).

How well did you know this?

Not at all

Perfectly

What are the 7 steps for SGD in machine learning?

7 Steps for SGD:

Initialize the weights
Make predictions using model with these weights
Calculate the loss based on these predictions
Calculate the gradient, which measures for each weight how changing that weight would change the loss.
Step (change) all the weights based on that calculation
Go back to step 2 and repeat the process
Stop the training process

How well did you know this?

Not at all

Perfectly

How do we initialize the weights in a model?

Use random numbers.

How well did you know this?

Not at all

Perfectly

What is loss?

Loss is value that represents how well (or badly) our model is doing.

How well did you know this?

Not at all

Perfectly

What can’t we always use a high learning rate?

A learning rate that is too high results in large steps that may miss the minimum loss, leading to the loss getting worse or bouncing around rather than converging at the minimum.

How well did you know this?

Not at all

Perfectly

What is a gradient?

A gradient is a derivative of the loss with respect to a parameter of the model.

How well did you know this?

Not at all

Perfectly

Why can’t we use accuracy as a loss function?

The gradient of a function is its slope–how much the value of the function changes divided by how much we changed the input:

(y_new - y_old)/(x_new - x_old)

The problem is that a small change in weight (x) isn’t likely to cause the prediction to change, so (y_new - y_old) will almost always be 0, i.e., the gradient is 0 almost everywhere. Thus, a small change in weight will often not change the accuracy at all. If the gradient is 0, the model can’t learn from that step. We need a function that can show differences from small changes in weights.

How well did you know this?

Not at all

Perfectly

What is special about the shape of the sigmoid function?

Study These Flashcards

It looks like an “S”. It can take any input value, positive or negative, and output a value between 0 and 1.

What is the function to calculate new weights using a learning rate?

Study These Flashcards

w -= w.grad * lr

What does the DataLoader class do?

Study These Flashcards

It takes in a dataset, shuffles it on every epoch and creates mini-batches. It returns an iterator over the batches.

Write pseudocode showing the basic steps taken in each epoch for SGD.

Study These Flashcards

for features, targets in a minibatch:

preds = model(features)
loss = loss_function(preds, targets)
calculate the gradients for each parameter
update the parameters by subtracting gradient * learning_rate
reset gradients to zero for each parameter

Each epoch goes through all the minibatches.

Create a function that if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2,’b’), (3,’c’), (4,’d’)]. What is special about that output data structure (as related to PyTorch Datasets)?

Study These Flashcards

list(zip([1,2,3,4], ‘abcd’))

This is the format of a Dataset in PyTorch: a collection that contains tuples of independent and dependent variables.

What does view do in Pytorch?

Study These Flashcards

It changes the shape of a tensor without changing its contents

What are the bias parameters in a neural network? Why do we need them?

Study These Flashcards

If we only used the function x*weights, the function would always be equal to zero when x = 0. No matter how we change the weight, the function would always be equal to zero. We need something a little more flexible, so we add bias to the equation.

What does the @ opertator do in Python

Study These Flashcards

Matrix multiplication

What does the backward method do?

Study These Flashcards

It calculates the gradients.

Why do we have to zero the gradients?

**backward** will add the calculated gradients to any gradients that are already stored. If you want to calculate gradients starting from zero, you have to zero out any gradients stored in the parameters.

What information do we have to pass to **Learner** (5 key things)?

DataLoaders (train and test data), model, optimization function, loss function, metrics

What is ReLU?

Rectified Linear Unit. It's an activation function that transforms any negative numbers into zero.

What is an activation function?

It's a nonlinear function

What's the difference betwen F.relu and nn.ReLu?

They represent the same thing but F.relu is a function and nn.ReLu is a PyTorch module.

The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

With more layers, we train the model more quickly and it will take up less memory because we don't need to use as many parameters with deeper models.

What is a "backward pass"?

Computing the gradients of the loss with respect to all model parameters

What is a gradient descent step?

Taking a step in the direction opposite to the gradients to make the model parameters a little bit better.

What is the learning rate?

The size of the step we take when applying SGD to update the parameters of the model.

Ch4 Training a Digit Classifier End-of-Chapter Questions Flashcards

(33 cards)