Ch4 Training a Digit Classifier End-of-Chapter Questions Flashcards

1
Q

What is a mini-batch?

A

A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch)`

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a “forward pass”?

A

Applying the model to some input and computing the predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is a greyscale image represented on a computer? How about a color image?

A

Greyscale images are represented by a matrix of pixel intensities with 0 representing white and 255 representing black. Shades of grey are numbers in between. A color image is represented by a set of 3 matrices, each representing a pixel’s intensity of red, green, and blue.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a rank-3 tensor?

A

It’s a tensor with 3 axes or dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the difference between tensor rank and shape? How do you get the rank from the shape?

A

Tensor rank refers to the number of axes in a tensor. Tensor shape refers to the length of each axis. You can get the rank from the shape by using the length function: len(tensorname.shape)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are RMSE and L1 norm?

A

RMSE - root mean square error. Take the mean of the squared difference and then take the square root of the answer. It penalizes bigger mistakes more heavily than L1 norm.

L1 norm - mean absolute difference. Take the mean of the absolute value of differences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is broadcasting?

A

When trying to do a mathematical operation between tensors of different ranks, broadcasting expands the tensor with the smaller rank to have the same size as the one with the larger rank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is SGD?

A

Stochastic Gradient Descent. It is an iterative algorithm that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why does SGD use mini-batches?

A

Instead of calculating the loss over all data items, which would take forever, SGD calculates the loss over a portion of data items at a time. This increases computational speed because it reduces the number of required calculations (derivatives).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 7 steps for SGD in machine learning?

A

7 Steps for SGD:

  1. Initialize the weights
  2. Make predictions using model with these weights
  3. Calculate the loss based on these predictions
  4. Calculate the gradient, which measures for each weight how changing that weight would change the loss.
  5. Step (change) all the weights based on that calculation
  6. Go back to step 2 and repeat the process
  7. Stop the training process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we initialize the weights in a model?

A

Use random numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is loss?

A

Loss is value that represents how well (or badly) our model is doing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can’t we always use a high learning rate?

A

A learning rate that is too high results in large steps that may miss the minimum loss, leading to the loss getting worse or bouncing around rather than converging at the minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a gradient?

A

A gradient is a derivative of the loss with respect to a parameter of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why can’t we use accuracy as a loss function?

A

The gradient of a function is its slope–how much the value of the function changes divided by how much we changed the input:

(ynew - yold)/(xnew - xold)

The problem is that a small change in weight (x) isn’t likely to cause the prediction to change, so (ynew - yold) will almost always be 0, i.e., the gradient is 0 almost everywhere. Thus, a small change in weight will often not change the accuracy at all. If the gradient is 0, the model can’t learn from that step. We need a function that can show differences from small changes in weights.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is special about the shape of the sigmoid function?

A

It looks like an “S”. It can take any input value, positive or negative, and output a value between 0 and 1.

17
Q

What is the function to calculate new weights using a learning rate?

A

w -= w.grad * lr

18
Q

What does the DataLoader class do?

A

It takes in a dataset, shuffles it on every epoch and creates mini-batches. It returns an iterator over the batches.

19
Q

Write pseudocode showing the basic steps taken in each epoch for SGD.

A

for features, targets in a minibatch:

  • preds = model(features)
  • loss = loss_function(preds, targets)
  • calculate the gradients for each parameter
  • update the parameters by subtracting gradient * learning_rate
  • reset gradients to zero for each parameter

Each epoch goes through all the minibatches.

20
Q

Create a function that if passed two arguments [1,2,3,4] and ‘abcd’, returns [(1, ‘a’), (2,’b’), (3,’c’), (4,’d’)]. What is special about that output data structure (as related to PyTorch Datasets)?

A

list(zip([1,2,3,4], ‘abcd’))

This is the format of a Dataset in PyTorch: a collection that contains tuples of independent and dependent variables.

21
Q

What does view do in Pytorch?

A

It changes the shape of a tensor without changing its contents

22
Q

What are the bias parameters in a neural network? Why do we need them?

A

If we only used the function x*weights, the function would always be equal to zero when x = 0. No matter how we change the weight, the function would always be equal to zero. We need something a little more flexible, so we add bias to the equation.

23
Q

What does the @ opertator do in Python

A

Matrix multiplication

24
Q

What does the backward method do?

A

It calculates the gradients.

25
Q

Why do we have to zero the gradients?

A

backward will add the calculated gradients to any gradients that are already stored. If you want to calculate gradients starting from zero, you have to zero out any gradients stored in the parameters.

26
Q

What information do we have to pass to Learner (5 key things)?

A

DataLoaders (train and test data), model, optimization function, loss function, metrics

27
Q

What is ReLU?

A

Rectified Linear Unit. It’s an activation function that transforms any negative numbers into zero.

28
Q

What is an activation function?

A

It’s a nonlinear function

29
Q

What’s the difference betwen F.relu and nn.ReLu?

A

They represent the same thing but F.relu is a function and nn.ReLu is a PyTorch module.

30
Q

The universal approximation theorem shows that any function can be approximated as closely as needed using just one nonlinearity. So why do we normally use more?

A

With more layers, we train the model more quickly and it will take up less memory because we don’t need to use as many parameters with deeper models.

31
Q

What is a “backward pass”?

A

Computing the gradients of the loss with respect to all model parameters

32
Q

What is a gradient descent step?

A

Taking a step in the direction opposite to the gradients to make the model parameters a little bit better.

33
Q

What is the learning rate?

A

The size of the step we take when applying SGD to update the parameters of the model.