Lecture 2 Flashcards

Question

What does back propagation allow for?

Answer 1

The derivatives of the error to be found by working backwards through the network - once we know activations and node outputs. Derivatives dE/dw6 require many applications of the chain rule - this will be very computationally expensive to find.

Answer 2

- Take a set of inputs and find the node outputs (y) and network outputs (forward propagation) - Calculate the error function E(w) - Calculate the derivatives (back propagation - to get deltas)

Answer 3

To get the delta of the final output node.

Answer 4

Deltas and outputs of nodes (y)

Answer 5

[See flashcard] These formulas arise because of the chain rule.

Answer 6

Recursion relation is a mathematical expression that defines a sequence in terms of its previous terms.

Answer 7

Once you have a delta further down the network, you can get deltas back on previous nodes.

Answer 8

We know how we should update our final layer weights - using residuals and outputs of the last hidden nodes (see flashcard)

Answer 9

Chain rule

Answer 10

We have a linear activation on our final activation

Answer 11

The sum of the residuals. This is mostly the reason why we hose the sum of squares error function to go with this activation function - it allows for a simple expression for the delta.

Answer 12

[See flashcard]

Answer 13

The vanishing gradient problem is a challenge that emerges during backpropagation when the derivatives or slopes of the activation functions become progressively smaller as we move backward through the layers of a neural network. This phenomenon is particularly prominent in deep networks with many layers, hindering the effective training of the model. For very large positive or negative activations, the derivative (gradient) tends to zero. This means when you propagate backwards, delta for a given node involves multiplication by 0. Therefore the weights will not update.

Answer 14

When using the sigmoidal activation function - it is bound by 0 and 1.

Answer 15

The ReLU() function - rectified linear unit activation function. This activation function does not saturate in both directions. Sigmoid functions saturate for both big and small activations. ReLU only saturates in one direction and therefore is less prone to the vanishing gradients problem.

Answer 16

Because of the recursion relation

Answer 17

[See flashcards] - graphs etc. Non-linearity comes about when you have very small activations.

Answer 18

- Can get around the vanishing gradient problem as it only saturates in one direction - Quicker to compute than sigmoid (sigmoid has an exponential which is costly to evaluate)

Answer 19

The dying ReLU problem occurs when ReLU neurons in a neural network become inactive and only output zero. This happens when neurons receive negative input, which can cause the network to stop learning. Occurs if too many activations are negative.

Answer 20

By using the leaky ReLU function. This looks similar to the ReLU, where positive activations = m. However, there is a small gradient for negative values (rather than a perfectly horizontal line). This means it does not saturate ie the output and derivatives never go to zero.

Answer 21

We don't have to - we cannot find an analytical solution for neural networks, so there is no need to use this error function. The error function can be considered a hyper parameter of the model, just as much as the architecture and choosing activations.

Answer 22

The sum of squares is not continuous or differentiable for the perceptron, because of the step activation function. With a single node, the error function depends on weights in a discontinuous way.

Answer 23

The sum-of-squares error

Answer 24

The cross-entropy error function. [See flashcard]

Answer 25

When all training data points are correctly classified.

Answer 26

When all training data points are incorrectly classified.

Answer 27

The sigmoidal activation function pairs well with the cross-entropy error function to give a simple expression of delta for the final node, which we can then use to propagate backwards.

Answer 28

In machine learning, an "adaptive method" refers to an algorithm that can automatically adjust its parameters, like the learning rate, based on the data it encounters during training, allowing it to adapt and improve its performance as it receives new information. We can modify v at each step.

Answer 29

Decrease v - we likely overshot a minima.

Answer 30

V may be too small, need to increase it.

Answer 31

Overfitting is a machine learning problem that occurs when a model is too closely trained to a specific set of data, making it unable to make accurate predictions for new data. The more parameters (weights and biases) we have, the more closely the network can fit the data. but this can come at the cost of not being able to make good predictions for new data. The network is too flexible. We do not want to fit the training data perfectly at the expense of generalisation.

Answer 32

One with lots and lots of parameters. A model with the capability to adapt to complex patterns in data by having a high capacity to learn, achieved through a high number of parameters. Networks that are too flexible give extremely accurate answers for training data, but poor answers for new data.

Answer 33

In real-world problems, functions are usually quite smooth.

Answer 34

Add some extra information (telling us which weights we want to keep or not) to help regularise the model.

Answer 35

Regularisation in machine learning is a technique that prevents overfitting by reducing the complexity of a model. It adds in extra information about the problem, penalising solutions that are overfitted.

Answer 36

o2 * |w|2 o - sigma

Answer 37

This extra term in the loss function gets big if the weights are very big. It avoids oscillatory solutions where the weights are alternating large positive and negative numbers. Ie it makes the solution smoother. o2 indicates how important information is. If o2 is 0, we don't care about weights being large. If o2 is big, the only thing that matters is that the weights shouldn't be too large.

Answer 38

- You can still analytically get the best weights for the error function by minimising it (differentiating and set to 0) - We derived the equation (adding o2 * |w|2) using intuition, but you can also get the same thing using Bayesian statistics

Answer 39

1 - early stopping (easy) 2 - dropout (more difficult but more elegant) 3 - Tikhonov or weight decay regularisation (almost no one uses)

Answer 40

When we look at the learning curves of training and test data, we may see that the error of the training curve continues to decrease over time. However, we may see that the test curve begins to increase at a certain point, which indicates overfitting. At this point, where it begins to increase, we could stop training the model, and use it as is before you overfit it. This is an easy, accessible method as you always have access to the learning cures.

Answer 41

Gradient descent.

Answer 42

We are overfitting.

Answer 43

One way to prevent overfitting is to train an ensemble of networks, with different architectures, show them all the same training sets to produce a prediction and average over their results. Each one overfits differently (some large, some small) and so the overfitting is hopefully washed out by averaging. However, training an ensemble of networks requires more computational effort. eg 100x more costly if they all have similar architecture / number of nodes and weights. The dropout method simulates an ensemble of neural networks, without actually having to have an ensemble of neural networks. Each is based on a starting network, this will be the network we want o make predictions on. We train the model over and over, each time updating the weights and dropping out some of the nodes/inputs. We have probability p for dropping out a unit (1 - p = the keep probability). Each time we update the weights, nodes / inputs are dropped by a dropout probability of p. When a node is dropped, the respective connections are also dropped (these weights will not be updated). All in all, this means that for each run, we are updating the weights on a slightly different network. In this way, we are making a single network act like a different network each time. We are training most, but not all of the nodes each time.

Answer 44

We want to use the original network to make predictions. At prediction time, the hidden units receive more activation than during training time - this is because when training the network not all the nodes (and therefore their connections) were present each time. The nodes therefore had a slightly lower activation during training than they will when using the full networks. If p is the probability that a weight was not updated during training, we multiply the activation by (1-p) when making predictions.

Answer 45

Applying a regularisation (similar to o2 * |w|2 for regression) - ie adding a term to the error function to regularise it.

Answer 46

- Symmetries - Time correlations - Dimensionality reductions etc. For our simple feed-forward fully connected network, we may not know in advance what the best network is for the problem of interest and we would need to test different architectures. But as possibilities are endless, we cannot try them all. - Deeper networks are better

Answer 47

Deeper networks are better - better to increase the number of layers rather than the number of nodes in a single layer. This is shown to decrease the error rate.

Answer 48

This is only the case if there are enough nodes. Generally, deeper networks need fewer units and less data. The further into the network information is propagated, the more complex combinations of features we have, and the more fine detail ie it is modular. The first layer of a NN takes into account the coarse detail, it combines features together and the second layer combines combinations of features etc. We get a more rich and complex function out due to this modularity.

Lecture 2 Flashcards

(74 cards)