Perceptron & Backpropagation Flashcards

Question

What do we hope the hidden layers are doing?

Answer 1

What we hope the hidden layers are doing is something akin to how our own vision works; by piecing together various components to classify what it is we’re seeing

Answer 2

9 and 8 have the same top component but a different bottom component. 4 comprises three lines. We might hope the last layer comprises these components. Any loop in the top frame of the image might activate the top neuron in the hidden layer and this would cause increased activation in output neurons which encode numbers with this feature. Going from the third layer to the last layer just requires learning which combination of components corresponds to which digits.

Answer 3

Recognising a loop can also run into some problems. One reasonable way to do this would be to first recognise the various little edges that make it up. Similarly a line is just a long edge, or a pattern of several small edges. This could be what the first hidden layer does; The initial image activates 8-10 specific little edges which in turn activates the upper loop and a long vertical line which in turn activates the number 9 in the output.

Answer 4

Whether this is what our final network actually does is a different story. This is the goal however and a useful way to think about it. This can also be expanded into thinking about how networks might break down more complex images, or even beyond image recognition there are a lot of things you might want to do that break down into layers of abstraction. For example, parsing speech involves taking raw audio and picking out distinct sounds which combine to make certain syllables, which combine to form words which combine to make up phrases and more abstract thoughts.

Answer 5

For these neurons to encode certain patterns, edges etc they are assigned weights. It can be helpful to think of these weights in a little grid of their own with green showing positive values, red as negative and their brightness as their strength. The activation of each neuron is then summed with its corresponding unit and a weighted sum is calculated. This can be thought of as the activation of each neuron being overlaid on this grid in order to determine how much they correspond with each other.

Answer 6

In a rectangle of activation, for example, there might be a 3x8 grid of positive weights with negative weights either side. This would mean that the sum is largest when the middle pixels are bright but the surrounding pixels are darker (e.g top line of 7).

Answer 7

When you compute sums like this you can come out with any number but we want numbers between 0 and 1. We commonly therefore use a function to convert the range to 0 and 1.

Answer 8

A common function which does this is the sigmoid function, also known as the logistic curve. This function converts very negative inputs to close to 0 and very positive inputs to close to one and steadily increases around the input 0. The activation of a neuron is therefore a weighted sum of how positive the relevant weighted sum is.

Answer 9

You may only want a particular neuron to become meaningfully active or fire when it the weighted sum exceeds a certain threshold. In other words you want a bias for inactivity. To do this we simply add a negative number to the weighted sum before plugging it into the sigmoid function (e.g f(w1a1 +.... + wnan -10)). This additional number is called the bias.

Answer 10

The weights tell you which pixel pattern this neuron in the second layer is picking up on and the bias is how high the weighted sum needs to be before the neuron starts becoming meaningfully active. This is carried out for every single neuron in the hidden layer.

Answer 11

That is 784 weights per individual neuron in the first hidden layer, with each neuron having its own bias. Thats 784x16 weights with 16 biases from the first layer to the second. The other layers have a bunch of weights and biases associated with them too. In total this network has 784x16 + 16x16 + 16x10 weights and 16+16+10 biases, meaning 13,002 weights and biases. 13,002 parameters which can be tweaked to make the network perform in different ways.

Answer 12

A more notationally compact way to present it is organising all the activations from one layer into a column as a vector. Then organise all of the weights as a matrix where each row of that matrix corresponds to the connections between one layer and a particular neuron in the next layer. This means that taking the weighted sum of the activations in the first layer according to these weights corresponds to one of the terms in the matrix vector product. Instead of adding the bias to each one of these independently, we represent it by adding each one of these biases into a vector and adding the vector to the previous matrix vector product. Then as a final step we can ‘wrap’ a sigmoid around the outside here. This is supposed to represent the fact that you’ll apply the sigmoid to each component of the resulting vector inside. This means we can communicate the full transitions of activations from one layer to the next in an extremely tight and neat little expression.

Answer 13

This ‘learning’ is essentially calculus and comes down to finding the minimum of a particular function. Remember, we are thinking of each ‘neuron’ being connected to each in the previous layer and the weights are like the strengths of those connections.

Answer 14

To train the computer based on supervised learning, you have to implement a cost function to teach it the difference between its output and the intended output.

Answer 15

In our previously described perceptron, a network with a 3 fed into it would have the intended output of an activation of 1 for the number 3 and an output of 0 for the other numbers. You then add the squared differences between the intended activation of each output neuron and the actual activation (e.g ... (0.88-1)^2 + ( 0.72-0)^2 +...). This sum is the cost of a particular training example. This sum is small when the network confidently (less activation on incorrect neurons) classifies the image correctly. This sum is large when it doesn't seem to know what it's doing.

Answer 16

You can then calculate the average cost over all the training data. This average cost can be a metric for how effective of a classifier it is.

Answer 17

To make this simpler, instead of imagining a function with 13,000 inputs, imagine a function with one number as an input and one as an output. We can sometimes find a value for an input that minimises the value of a function through calculus explicitly e.g when a function is a curve with a single minimum.

Answer 18

A more flexible tactic is to start at any input and figure out which direction you should ‘step’ to make that output lower. Specifically, if you can find the slope of the function ‘where you are’, then shift to the left if that slope is positive and to the right if that slope is negative. If you do this repeatedly, at each point checking the new slope and taking the appropriate step, you’re going to approach some local minimum of the function.

Answer 19

A common metaphor for this is a ball rolling down a hill. There are many possible valleys you might land in depending on what random input you start in. There’s no guarantee the local minimum you land in is going to be the smallest possible solution for the cost function.

Answer 20

Finding these local minima is pretty doable, finding the global minimum of something as complex as the cost function can be crazy difficult.

Answer 21

If you make your step sizes proportional to the slope, then when your slope is flattening out towards the minimum, your steps get smaller and smaller. This prevents you from overshooting.

Answer 22

To increase the complexity a bit, imagine a function with two inputs and one output. You might think of the input space as the x,y plane and the cost function as being graphed as a z surface above it. Instead of asking about the slope of the function, you have to ask which direction you should step in this input space so as to decrease the output of the function more quickly. In other words, what is the downhill direction?

Answer 23

In multivariate calculus, the gradient of a function gives you the direction of steepest descent. In other words, what direction should you go to increase the function most quickly. Taking the negative of that gradient can give you the step that decreases the function most quickly. Even more than that, the length of this gradient vector is an indication for just how steep that steepest slope is. The algorithm for minimising the function is to compute this gradient direction, take a small step downhill and just repeat that continuously.

Answer 24

This is the same idea for a function which has 13,000 inputs instead of just two inputs. Imagine organising all 13k weights and biases over a network into a giant column vector. The negative gradient of the cost function is just a vector. It’s some direction inside this insanely huge input space that tells you which nudges to all of those numbers is going to cause the most rapid decrease to the cost function. Changing the weights and biases to decrease it means making the output of the network on each piece of training data look less like a random array of 10 values and more like an actual decision we want it to make.

Answer 25

Remember, this cost function is an average over all the training data, so if you minimise it it improves the performance of all of those samples.

Answer 26

The algorithm for computing these gradients efficiently is essentially the ‘heart’ of how neural networks learn and is called back-propagation.

Answer 27

A consequence of the concept of minimising a cost function is that it is important for this cost function to have a nice smooth output (as in the z surface), so that we can find the local minimum by taking small steps downhill. This is why artificial neurons have continuous activations rather than being either active or inactive in a binary way as in biological neurons.

Answer 28

This process of repeatedly nudging an input of a function by some multiple of a negative gradient is called gradient descent. It's a way to converge towards some local minimum of a cost function.

Answer 29

Each component of the negative gradient tells us two things, the sign tells us whether the corresponding coordinate of the input vector should move up or down and the relative magnitudes of the components kind’ve tell you which changes matter more. The adjustment of one weight might matter a lot more to the cost function (∴ the training data) than some other weight. So a way you can think of the gradient vector of this huge cost function is that it encodes the relevant importance of each weight and bias; the changing of which parameters induces the most valuable change. This is essentially direction

Answer 30

This essentially translates to the direction of the direction of that on the x,y plane (line with a rise of 1 and run of 3; slope: 3/1) is the direction of the steepest slope of the surface above it. This is essentially saying that the changes of one variable (e.g the function 3/2x^2 + 1⁄2 y^2) has three times the impact as changes to another variable.

Answer 31

The network did not pick out clear interpretable shapes and lines as we had expected, but instead showed almost random looking images with loose patterns in the middle pictured on the left. The narrator stated that the network, through its different weights and biases found a nice local minimum that, despite successfully classifying most images, doesn’t really pick up on the patterns that we might have hoped for.

Answer 32

You can input a random pixel image and it will confidently give an answer; It knows how to recognise a 5 but not how to draw one. A lot of this is because its such a tightly constrained training setup; the entire test set consists of clearly defined unmoving digits centred in a tiny grid and its cost function does not give it any incentive to be anything but certain in any of its decisions.

Answer 33

You can think of it as the magnitude of each component telling you how sensitive the cost function is to each weight and bias. Say, for example, you go through the backpropagation process and the component of the weight associated with one connection was 3.20 and the weight of another component of a connection was 0.10. This could be interpreted as the changing of the first weight changes the overall value of the cost function by a magnitude 32 times greater than a change of the same magnitude of the second weight; it is 32 times more sensitive to changes in the first weight than the second weight.

Answer 34

The activations cannot be directly activated, only the weights and biases. It is helpful to track each of the adjustments we wish to make to that output layer. Since we want it to classify the network to classify it as a 2, we want that activation level to be nudged up and the others to be nudged down. The further the activation is from 1, the closer we want to make it to 1 for the 2 neuron and the further from 0 for the others, the more we want to nudge it down. Aka the sizes of the nudges should be proportional to how far the current value is to its target value.

Answer 35

This activation is defined as a certain weighted sum of all of the activations of the previous layer plus a bias which is all plugged into something like the sigmoid function or a ReLU.

Answer 36

There are therefore three different avenues they could ‘team up’ in order to increase that activation; you can increase the bias (b), you can increase the weights (wi) and you can change the activations (ai) from the previous layer.

Answer 37

In regards to the weights, they have different levels of influence. Those coming from the brightest neurons of the previous layer have the strongest effect as they are multiplied by larger activation values. So increasing the weights has different levels of effectiveness for different neurons for this example (based on the activation it evokes).

Answer 38

Remember when we talk about gradient descent, we don’t just care about which component we want to get nudged up or down, we care about which is the most informative with a given change. This is somewhat aligned with the process of Hebbian learning in biological neurons; those which fire together wire together. The biggest increase in weights and strengthening of connections happens between neurons which are the most active and ones which we wish to become more active. We can therefore increase wi in proportion to ai.

Answer 39

The third way we can help increase this neurons activation is by changing all the activations in the previous layer. Namely, if everything connected to that 2 neuron with a positive weight got brighter, and everything with a negative weight got dimmer, then that 2 neuron would become more active and similar to the weight changes you’re going to get the most information from differences in activation by seeking changes which are proportional to the corresponding weights. Of course we cannot directly influence those activations, we only have control over the weights and biases.

Answer 40

Remember we not only want the 2 neuron to become more active, but for all the other neurons in the last layer to become less active and each of those output neurons has its own goals for what should happen to that second last layer. So the desire for this 2 neuron is added to the desires of all the other neurons for what should happen to this 2nd last layer in proportion to the corresponding weights and in proportion to how much each of these neurons need to change.

Answer 41

By adding all of these desired effects you basically get a series of nudges which you want to happen to this second last layer. Once you have those, you can carry out the same process to the weights and biases which determine those values. This process is repeated backwards through the network. Remember this is all just how a single training example nudges each of those weights and biases. You go through each step of this backprop routine for each example, recording how each of them would like to change the weights and the biases. You then average together those desired changes. This collection of the averaged nudges to each weight and bias is, loosely speaking, the negative gradient of the cost function or something proportional to it.

Answer 42

In practice, it takes computers an extremely long time to add up the influence of every training example of every gradient descent step.

Answer 43

What's done in practice instead is that you randomise you training data and assign them to ‘mini-batches’, say each having 100 examples. You then compute a step according to the mini-batch. It won't be the actual gradient of the cost function, which depends on all of the training data so it's not the most efficient step downhill. Each mini-batch does give a good approximation however and results in a significant computational speed-up. If you were to visualise this on the relevant cost surface, it would look more like a winding route in quick steps down to the eventual solution rather than a more calculated route going the exact downhill direction.

Answer 44

This technique is referred to as stochastic gradient descent. For this entire process to work you need big databases.

Answer 45

value - expected value y - yhat: y;yhat; error 0 0 0 0 1 -1 1 0 1 1 1 0

Answer 46

XOR function: it needs multiple inputs encoding two OR functions in the input from a hidden layer and and AND in the output of the neuron (multiple line): 𝒙𝟏 𝒙𝟐 𝒚^ 0 0 1 0 1 0 1 0 0 1 1 1

Answer 47

if you change too little it will take forever; With gradient descent; if you make the steps proportionate to the slope; make big steps when far away from (local) minimum (large error), and steps get smaller and smaller as they approach it (smaller error) --> prevents overshooting

Answer 48

Perturbation learning; seems to convery same idea anyway not sure

Answer 49

Hebbian: no feedback ( Simple Hebbian learning cannot make meaningful changes to the blue synapse, because it does not consider this synapse’s downstream effect on the network output.) Perturbation: scalar feedback Backpropagation: Vector feedback Backprop-like learning with feedback network: Vector feedback

Answer 50

Scalar feedback : "right or wrong"; Perturbation methods measure the change in error caused by random perturbations to neural activities (node perturbation) or synapse strengths (weight perturbation) and use this measured change as a global scalar reinforcement signal that controls whether a proposed perturbation is accepted or rejected. Vector feedback: "Tells you whats wrong"; The backprop algorithm instead computes the synapse update required to most quickly reduce the error. In backprop, vector error signals are delivered backwards along the original path of influence for a neuron

Answer 51

Vector feedback has a higher precision of synaptic changes in reducing error but takes much longer In order of precision: weight pertubation (changing the weights 1 by 1 and see how output changes), node pertubation, backpropagation approximations, backpropagation

Answer 52

k = postsynaptic j = presynaptic h = output a = activation w = weights hk = f(ak) ak = sum(hjwjk)

Answer 53

Error is calculated at the end of the network with E = 1/2 sum(tl - y_^l)^2 with l being an output neuron ek = sum(𝛿l Wlk) with 𝛿 being the difference; error signal propagated backwards 𝛿k = ek f'(a_k)

Answer 54

ΔWij = -n (𝛿E / 𝛿Wij) = -n hi 𝛿j n = learning rate (?) where 𝛿j = ej f'(aj) = (sum_k(𝛿kWk))f'(aj) so 𝛿j =sum all the weights of neurons in layer k inputing to neuron j times the difference in activation of k * the derived function of the activation of j ΔWij = learning rate * output i * 𝛿j

Answer 55

Generalizes to any number of layers; Only limit is your computer’s RAM and your time

Answer 56

measure how much these features have to do with one or two outputs separation of dorsal vs ventral --> neurons are expected to become specialised early on => if they are thought to be similar --> both neurons help each other

Answer 57

Way of investigating whether we get streams when you have objective functions that are related or unrelated : Network classifies the word/text and image that is shown simultaneously; image and text can be related or unrelated (Robin image: 'bird', Robin image: 'lip') Unrelated: expect early specialisation; neurons in network form streams (they don't help each other) :: Some neurons respond to text recognition and some respond to image recognition :: Become two streams Related: expect: neurons specialised in the neuron later stage --> unspecialized, one stream; neurons help each other

Answer 58

1. Phases: Forward phase and w output and backward phase where errors are computed :: can't really relate that to biology, maybe oscillations? But no idea 2. Weight symmetry: forward weights need to be mirrored to go back 3. One neuron but two types of signals :: one related to learning, other to processing ('probably') 4. Supervised learning: Really reliable on supervised learning, very rare

Answer 59

Separate feedback network (additional to forward pass) that calculates delta/error and communicated to the corresponding neuron in the forward pass Or: keep feedback weights fixed Target propagation: learn this function that needs to be back-propagated Reinforcement-learning based feedback

Answer 60

The output projects to another network which activates backwards. There are connections between these error neurons and neurons in the feedforward network.

Answer 61

More biologically plausible bc we have separate neurons that encode forward and error There is no learning in this network, the weights of the feedback are fixed and the error is computed wrongly (feedback alignment). The feedforward pass compensates for this and therefore it works.

Answer 62

Remarkably, networks with fixed random feedback weights learn to approximately align their feedforward synaptic weights to their feedback weights. In a display of neural pragmatism, the fake error derivatives computed using the random feedback weights cause updates to the feedforward weights that make the true error derivatives closer to the fake derivatives. This surprising phenomenon, called “feedback alignment”, suggests that feedback connections do not need to be symmetric to their feedforward counterparts in order to deliver information that can be used for fast and effective weight updates

Answer 63

Need a learning rule for this

Answer 64

Target propagation :: If desired output is known you can use a function to tell you what the output of the previous layer should be :: error function is calculating directly --> inverse function instead :: You can learn the correct targets for previous layer by function g From article: We propagate activity forward through successive layers of a network to produce a predicted output. Then we propagate an output target backwards through inverse functions (i.e. via feedback connections) that are learned through layer-wise autoencoding of the forward layers. This backward propagated target induces hidden activity targets that should have been realised by the network. In other words, if the network had achieved these hidden activities during feedforward propagation, then it would have produced the correct output. The direction in the activity space between the feedforward activity and the feedback activity indicates the direction in which the neurons’ activities should move in order to improve performance on the data. Learning proceeds by updating forward weights to minimize these local layer-wise activity differences, and it can be shown that under certain conditions the updates computed using these layer-wise activity differences approximate those that would have been prescribed by backprop.

Answer 65

achieve the target in every layer or something Weight symmetry problem is solved, but phase is still an issue The algorithm provides a compelling example of how locally generated activity differences can be used to drive learning updates for multi-layer networks

Answer 66

Difference target propagation effectively trains multi-layer neural networks on classification tasks such as MNIST and CIFAR41, and it learns in a fraction of the time required by algorithms that us weight or node perturbation to update weights. Recent work shows that straightforward implementations of DTP do not perform as well as backprop on the ImageNet task with large convolutional network

Answer 67

Difference target propagation effectively trains multi-layer neural networks on classification tasks such as MNIST and CIFAR41, and it learns in a fraction of the time required by algorithms that us weight or node perturbation to update weights. Recent work shows that straightforward implementations of DTP do not perform as well as backprop on the ImageNet task with large convolutional network

Answer 68

Reinforcement learning with feedback

Answer 69

Implements simultaneous network with interneurons tag which neurons helped you make that action and then you get reward-prediction error Tag all neurons involved in feedback path --> you know these neurons are only involved in that particular action Change weights according to this

Answer 70

You can get exact error-backpropagation, one sample at a time Almost equal to EBP, also for big problems and it is only a bit slower

Perceptron & Backpropagation Flashcards

(94 cards)