Lecture 2 Flashcards
What are feed-forward networks?
A “feed forward network” in machine learning refers to a type of artificial neural network where data flows in only one direction, from the input layer through any hidden layers, and finally to the output layer, without any information looping back or being reused from previous layers.
What can we say about neural networks with one hidden layer?
If there are enough hidden units, they can approximate any function arbitrarily accurate.
What is the proof for saying a neural network with one hidden layer is a universal approximation?
Hand-waving proof rather than mathematical proof (looked at regression).
We can approximate a curve with lots of step functions.
Assuming we have the data, what do we need to set for a NN model?
- Network architecture (connections etc.)
- Weights and biases
What error function do we use for regression?
MSE
[See flashcards]
How do we get the best set of weights for regression?
Minimising the error function (MSE) with respect to the weights.
Why does the error function for linear regression have an analytical solution?
The MSE is quadratic in the weights.
We can differentiate to find the minimum, but this can’t be solved analytically to find the weights.
What is the relationship between the outputs and inputs in a neural network?
In a neural network, the outputs zi are highly nonlinear functions of the inputs xi.
How do the error functions for linear regression and neural networks compare?
The linear regression error is a simple, quadratic function with a single minima.
Neural networks have many complex minima. Different sets of weights can maximise these different minima. There is no analytical solution for the best weights in a neural network.
As there is no analytical formula for the best weights in a neural network, how do we find them?
Numerically.
How do we find the weights?
Begin with a “guess” for the weights, and change them in steps, decreasing the error each time.
When do we know we have reached a minima of the error function?
When the weights stop decreasing (when we no longer decrease the error).
What is the formula for going from step W(T) to W(T+1)
[See flashcard]
With each step, we want to make a small change to the weights.
When do we have the biggest decrease in error?
When the change in w points in the same direction as the negative of the differential of the error with respect to the weights - the dot product is 1, making it as big as possible.
Change in w proportional to -dE(w)/dw
What is the parameter v?
The learning rate.
This controls how quickly we decrease the error.
We have to tune the learning rate to control the optimisation.
Describe how v affects the model.
If v is too fast and may overshoot the minimum.
If v is too slow, we might never get there.
If we are close to a local minimum, what happens?
We move towards this local minimum, rather than the global minimum.
How do we find the global minimum?
1 - train several networks with random weights. We will find multiple minima which we can compare.
2 - use a minimisation method that may let us escape local minima (eg by “hopping over”) local minima
3 - using intuition to make an informed decision
Why may it be useful not to use the entire data set to change the weights (batch learning)?
If there is redundancy in the data (repetitions), we are doing more work than we need to.
How could we update the weights instead of batch learning?
We could randomly choose a training point i.
Each step now minimises the error for one training point, rather than the total error.
We choose points at random (a stochastic method) - we can get something unexpected from randomness which can allow us to get away from local minima.
The error for a single point decreases, but the total error may sometimes increase which allows us escape from local minima.
Rather than batch learning or sequential learning, what do we use in practice?
Explain this.
Multi-batch learning - this is somewhere between batch and sequential learning. Has the benefits of both (sequential learning - stochastic/randomness of training points, batch learning - using more information).
Each time weights are updated, choose a subset of the training set (mini-batch) randomly. We then repeat this with different batches.
This has the benefit that we only need to calculate the derivatives of the error with respect to the weights, and don’t need to solve for w. The derivates are costly to calculate- we use back-propagation to avoid this cost.
Why can we never be sure we have reached a global minima?
In theory there are infinite numbers of minima.
What is back propagation?
Backpropagation is a powerful algorithm in deep learning, primarily used to train artificial neural networks, particularly feed-forward networks. It works iteratively, minimising the cost function by adjusting weights and biases.
“Cost function” - measures the difference between the network’s predictions and the actual target values. A measure of how well the network is performing.
What does inputting xi into the network and getting the output zi involve?
The forward propagation of information through the network.
This is how we make predictions with a. neural network.
What does back propagation allow for?
The derivatives of the error to be found by working backwards through the network - once we know activations and node outputs.
Derivatives dE/dw6 require many applications of the chain rule - this will be very computationally expensive to find.
Briefly, what is the algorithm for updating weights with back propagation?
- Take a set of inputs and find the node outputs (y) and network outputs (forward propagation)
- Calculate the error function E(w)
- Calculate the derivatives (back propagation - to get deltas)
What do we need for backward propagation?
To get the delta of the final output node.
What do we use to get the derivatives of errors with respect to weights?
Deltas and outputs of nodes (y)
What are the two equations associated with back propagation?
[See flashcard]
These formulas arise because of the chain rule.
What is a recursion relation?
Recursion relation is a mathematical expression that defines a sequence in terms of its previous terms.
What is the principal of back propagation?
Once you have a delta further down the network, you can get deltas back on previous nodes.
Even without carrying out the full back propagation or difficult differentiation, what do we know by finding the delta of the final layer?
We know how we should update our final layer weights - using residuals and outputs of the last hidden nodes (see flashcard)
How do we find how a weight within a hidden layer appears in the output?
Chain rule
In our case, why is the final delta easy to calculate?
We have a linear activation on our final activation
What is the delta for the final layer given by (with linear activation function)?
The sum of the residuals.
This is mostly the reason why we hose the sum of squares error function to go with this activation function - it allows for a simple expression for the delta.
When we have a sigmoid function on the hidden layers, what is delta given by?
[See flashcard]
What is the vanishing gradient problem?
The vanishing gradient problem is a challenge that emerges during backpropagation when the derivatives or slopes of the activation functions become progressively smaller as we move backward through the layers of a neural network. This phenomenon is particularly prominent in deep networks with many layers, hindering the effective training of the model.
For very large positive or negative activations, the derivative (gradient) tends to zero. This means when you propagate backwards, delta for a given node involves multiplication by 0. Therefore the weights will not update.
When do we experience the vanishing gradient problem?
When using the sigmoidal activation function - it is bound by 0 and 1.
What function does not cause gradients to vanish?
The ReLU() function - rectified linear unit activation function.
This activation function does not saturate in both directions. Sigmoid functions saturate for both big and small activations. ReLU only saturates in one direction and therefore is less prone to the vanishing gradients problem.
Why does the vanishing gradient problem become apparent during back propagation?
Because of the recursion relation
Describe the ReLU() function.
[See flashcards] - graphs etc.
Non-linearity comes about when you have very small activations.
What are advantages of the ReLU?
- Can get around the vanishing gradient problem as it only saturates in one direction
- Quicker to compute than sigmoid (sigmoid has an exponential which is costly to evaluate)
What is the “dying ReLU” problem?
The dying ReLU problem occurs when ReLU neurons in a neural network become inactive and only output zero. This happens when neurons receive negative input, which can cause the network to stop learning.
Occurs if too many activations are negative.
How can you solve the “dying ReLU” problem?
By using the leaky ReLU function.
This looks similar to the ReLU, where positive activations = m. However, there is a small gradient for negative values (rather than a perfectly horizontal line). This means it does not saturate ie the output and derivatives never go to zero.
Why would we not use the sum-of-squares error function (MSE) for neural networks?
We don’t have to - we cannot find an analytical solution for neural networks, so there is no need to use this error function.
The error function can be considered a hyper parameter of the model, just as much as the architecture and choosing activations.
Why would we not choose the sum-of-squares error as the perceptron error function?
The sum of squares is not continuous or differentiable for the perceptron, because of the step activation function.
With a single node, the error function depends on weights in a discontinuous way.
What error function is good for regression problems?
The sum-of-squares error
What error function is good for classification problems?
The cross-entropy error function.
[See flashcard]
When is the cross-entropy error zero?
When all training data points are correctly classified.
When is the cross-entropy error infinite?
When all training data points are incorrectly classified.
If we use the cross-entropy error function with a sigmoidal activation function, what delta for the final node do we get?
Zi - ti
What is one reason to use the sigmoidal activation function over the ReLU activation function?
The sigmoidal activation function pairs well with the cross-entropy error function to give a simple expression of delta for the final node, which we can then use to propagate backwards.
What is an adaptive method?
In machine learning, an “adaptive method” refers to an algorithm that can automatically adjust its parameters, like the learning rate, based on the data it encounters during training, allowing it to adapt and improve its performance as it receives new information.
We can modify v at each step.
How would we modify v if the error increases?
Decrease v - we likely overshot a minima.
How would we modify v if the error barely changes?
V may be too small, need to increase it.
What is overfitting?
Overfitting is a machine learning problem that occurs when a model is too closely trained to a specific set of data, making it unable to make accurate predictions for new data.
The more parameters (weights and biases) we have, the more closely the network can fit the data. but this can come at the cost of not being able to make good predictions for new data. The network is too flexible.
We do not want to fit the training data perfectly at the expense of generalisation.
What is a flexible model?
One with lots and lots of parameters.
A model with the capability to adapt to complex patterns in data by having a high capacity to learn, achieved through a high number of parameters.
Networks that are too flexible give extremely accurate answers for training data, but poor answers for new data.
What can our intuition tell us about the function to use to fit data?
In real-world problems, functions are usually quite smooth.
If we fit the data perfectly with a function, we can get quite high ordered polynomials with associated weights. What can we add to improve this?
Add some extra information (telling us which weights we want to keep or not) to help regularise the model.
What is regularisation?
Regularisation in machine learning is a technique that prevents overfitting by reducing the complexity of a model.
It adds in extra information about the problem, penalising solutions that are overfitted.
For linear regression, what term is added to the error function?
o2 * |w|2
o - sigma
What does adding the term o2 * |w|2 to the error function achieve?
This extra term in the loss function gets big if the weights are very big.
It avoids oscillatory solutions where the weights are alternating large positive and negative numbers. Ie it makes the solution smoother.
o2 indicates how important information is. If o2 is 0, we don’t care about weights being large. If o2 is big, the only thing that matters is that the weights shouldn’t be too large.
What are the benefits of regularisation?
- You can still analytically get the best weights for the error function by minimising it (differentiating and set to 0)
- We derived the equation (adding o2 * |w|2) using intuition, but you can also get the same thing using Bayesian statistics
What are three ways to avoid overfitting?
1 - early stopping (easy)
2 - dropout (more difficult but more elegant)
3 - Tikhonov or weight decay regularisation (almost no one uses)
Describe the early stopping method.
When we look at the learning curves of training and test data, we may see that the error of the training curve continues to decrease over time. However, we may see that the test curve begins to increase at a certain point, which indicates overfitting.
At this point, where it begins to increase, we could stop training the model, and use it as is before you overfit it.
This is an easy, accessible method as you always have access to the learning cures.
When you update weights, the error decreases on the training set. Why is this guaranteed?
Gradient descent.
What can be said if the error on the testing/validation set starts to increase?
We are overfitting.
Describe the dropout method.
One way to prevent overfitting is to train an ensemble of networks, with different architectures, show them all the same training sets to produce a prediction and average over their results. Each one overfits differently (some large, some small) and so the overfitting is hopefully washed out by averaging.
However, training an ensemble of networks requires more computational effort. eg 100x more costly if they all have similar architecture / number of nodes and weights.
The dropout method simulates an ensemble of neural networks, without actually having to have an ensemble of neural networks. Each is based on a starting network, this will be the network we want o make predictions on. We train the model over and over, each time updating the weights and dropping out some of the nodes/inputs.
We have probability p for dropping out a unit (1 - p = the keep probability). Each time we update the weights, nodes / inputs are dropped by a dropout probability of p. When a node is dropped, the respective connections are also dropped (these weights will not be updated).
All in all, this means that for each run, we are updating the weights on a slightly different network. In this way, we are making a single network act like a different network each time. We are training most, but not all of the nodes each time.
Following training using the drop out method, how do we make predictions?
We want to use the original network to make predictions. At prediction time, the hidden units receive more activation than during training time - this is because when training the network not all the nodes (and therefore their connections) were present each time. The nodes therefore had a slightly lower activation during training than they will when using the full networks.
If p is the probability that a weight was not updated during training, we multiply the activation by (1-p) when making predictions.
Describe Tikhonov / weight decay regularisation.
Applying a regularisation (similar to o2 * |w|2 for regression) - ie adding a term to the error function to regularise it.
There are an infinite number of possible neural networks. How can we begin to consider choosing the best architecture for our network?
- Symmetries
- Time correlations
- Dimensionality reductions etc.
For our simple feed-forward fully connected network, we may not know in advance what the best network is for the problem of interest and we would need to test different architectures. But as possibilities are endless, we cannot try them all.
- Deeper networks are better
Which are better, deeper networks or shallower networks?
Deeper networks are better - better to increase the number of layers rather than the number of nodes in a single layer. This is shown to decrease the error rate.
Why are deeper networks better if we said that a single hidden layer is enough to give a universal approximater?
This is only the case if there are enough nodes.
Generally, deeper networks need fewer units and less data. The further into the network information is propagated, the more complex combinations of features we have, and the more fine detail ie it is modular.
The first layer of a NN takes into account the coarse detail, it combines features together and the second layer combines combinations of features etc. We get a more rich and complex function out due to this modularity.