Perceptron & Backpropagation Flashcards

1
Q

How does the lecturer compare a brain to a computer?

A

Brains compute actions from perception
=> hidden states(info stuck in brain that influences what you’re going to do): goals, plans
=> learning: Brain is flexible, learning going on
▪outcome dependent (reinforcement/supervised; ‘if i go to bed at the right time i’ll get a good nights sleep’)
▪outcome independent (unsupervised/self-supervised)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe how spiking neurons are modelled

A

early models: spikes (McCulloch Pitts)

An input affects the firing rate which can be an analogue value between 0 and 200 Hz

h = hidden layer, something that you can’t observe; A hidden layer is located between the input and output of the algorithm, in which the function applies weights to the inputs and directs them through an activation function as the output. 

n neurons in hidden layer, n in next layer –> total of n^2 layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What functions can the transfer function take?

A

Linear, hyperbolic tangent etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the purpose of the activation and transfer function?

A

Activation functions work on some threshold value and once that value is crossed the signal is triggered, While Transfer function is used to translate input to output signals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give a very simplified generalised model of an artificial neural network

A

O = f(x1, x2)
(Output is a function of input variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is meant by a deep neural network?

A

A network with many hidden layers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Name five identification processes in a deep neural network

A
  1. Identify pixel values
  2. Identify edges
  3. Identify combinations of edges
  4. Identify features
  5. Identify combinations of features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give an example of objective functions irl

A

ultimate objective function = survive and procreate

but its not just that

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are these identification process compared to physiology?

A

some neurons become sensitive to edges

some neurons are sensitive to parts of the face

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the relationship between parameters and fit

A

Too few parameters is insufficient and does not capture the data well enough, too many parameters fits the data too well, overfits and does not describe the underlying generalisable relationship. Taken as an analogy of a line on a graph, the lines could perfectly outline the shape in an image of an elephant but this would not generalise to other images of elephants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a loss function?

A

A loss function is a function that compares the target and predicted output values; measures how well the neural network models the training data. When training, we aim to minimise this loss between the predicted and target outputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the architecture of a network concern?

A

How does a certain architecture with certain rules achieve certain objective functions?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Where does neurological preprocessing take place? How are the parameters updated?

A

Retina does fantastic preprocessing and is fixed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is visual processing mostly based on?

A

Visual processing is mostly based on contrast

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the learning rules of an algorithm; what methods do they employ?

A

They use a gradient and a learning update rule. The loss function is a method of evaluating how well your machine learning algorithm models your featured data set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

As a summary what does neural learning combine?

A

Architecture, loss function(s), learning rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Give two examples of these neural learning rules

A

Cortical columns (architecture), plasticity rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Learning rules get neural networks to do useful stuff. Give some examples of useful stuff

A

=> Identify digits in images (MNIST)
=> Translate Chinese-to-Dutch
=> Recommend movies based on past ratings (Netflix prize)
=> Modeling cognitive processes:
Attention, perception, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the function of learning algorithms?

A

Finding suitable set of parameters (weights)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Describe the maths behind a simple neuron in a neural network

A

𝑠 = 𝑏 +∑𝑤i𝑥i
s = activation)
𝑤i are the weights and 𝑏 is the bias in the network.
𝑓 is termed as the activation function.
𝑓(𝑠) is the output of the neural network.
𝑓 (𝑠) = 1 𝑠 > 0; 0 s ≤ 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Describe how this neuron works

A

This model can work on non-boolean values where each input connection gets associated with a weight. Here the function calculates the weighted sum and based on the threshold value provided, it gives a binary output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe the initial layer of a perceptron with a 28x28 pixel image

A

With a given input image of 28x28 pixels, a perceptron can attempt to output which number is shown in the image. This translates to 784 ‘neurons’ each with their own activation, which can roughly correspond to how ‘bright’ it is in the image below ranging from black at 0 to white at 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What would the outcome layer consist of?

A

The last column, the output layer, comprises the nine possible numbers the network could categorise the input into. Each of these ‘neurons’ also have an activation relating to how likely the number is given the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What do the layers in between consist of? escribe how they work

A

The layers between these two are called hidden layers. This network can have e.g two layers of 16 neurons each but that is somewhat of an arbitrary choice here. The activations of one layer (e.g input layer, hidden layer) determine the activations in the next layer (e.g hidden layer, output layer). This network has already been trained to learn digits. This means that if you feed it an image with different activations in each pixel, this causes a very specific pattern of activation in the next layer, which gives some pattern to the one after it, which gives a specific pattern to the output layer. The neuron with the most activation is then selected for what the image represents.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What do we hope the hidden layers are doing?

A

What we hope the hidden layers are doing is something akin to how our own vision works; by piecing together various components to classify what it is we’re seeing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What might we hope the last hidden layer is doing?

A

9 and 8 have the same top component but a different bottom component. 4 comprises three lines. We might hope the last layer comprises these components. Any loop in the top frame of the image might activate the top neuron in the hidden layer and this would cause increased activation in output neurons which encode numbers with this feature. Going from the third layer to the last layer just requires learning which combination of components
corresponds to which digits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What might we hope the first hidden layer is doing?

A

Recognising a loop can also run into some problems. One reasonable way to do this would be to first recognise the various little edges that make it up. Similarly a line is just a long edge, or a pattern of several small edges. This could be what the first hidden layer does; The initial image activates 8-10 specific little edges which in turn activates the upper loop and a long vertical line which in turn activates the number 9 in the output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How reasonable are these ideas?

A

Whether this is what our final network actually does is a different story. This is the goal however and a useful way to think about it. This can also be expanded into thinking about how networks might break down more complex images, or even beyond image recognition there are a lot of things you might want to do that break down into layers of abstraction. For example, parsing speech involves taking raw audio and picking out distinct sounds which combine to make certain syllables, which combine to form words which combine to make up phrases and more abstract thoughts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How do these neurons encode certain patterns?

A

For these neurons to encode certain patterns, edges etc they are assigned weights. It can be helpful to think of these weights in a little grid of their own with green showing positive values, red as negative and their brightness as their strength.

The activation of each neuron is then summed with its corresponding unit and a weighted sum is calculated. This can be thought of as the activation of each neuron being overlaid on this grid in order to determine how much they correspond with each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How can activation and inactivation be used?

A

In a rectangle of activation, for example, there might be a 3x8 grid of positive weights with negative weights either side. This would mean that the sum is largest when the middle pixels are bright but the surrounding pixels are darker (e.g top line of 7).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the role of the function then in this process?

A

When you compute sums like this you can come out with any number but we want numbers between 0 and 1. We commonly therefore use a function to convert the range to 0 and 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Describe a function commonly used for this

A

A common function which does this is the sigmoid function, also known as the logistic curve. This function converts very negative inputs to close to 0 and very positive inputs to close to one and steadily increases around the input 0. The activation of a neuron is therefore a weighted sum of how positive the relevant weighted sum is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Why might we want to input bias in our model?

A

You may only want a particular neuron to become meaningfully active or fire when it the weighted sum exceeds a certain threshold. In other words you want a bias for inactivity. To do this we simply add a negative number to the weighted sum before plugging it into the sigmoid function (e.g f(w1a1 +…. + wnan -10)). This additional number is called the bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Therefore what do the weights and bias tell you?

A

The weights tell you which pixel pattern this neuron in the second layer is picking up on and the bias is how high the weighted sum needs to be before the neuron starts becoming meaningfully active. This is carried out for every single neuron in the hidden layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How many parameters would this model have?

A

That is 784 weights per individual neuron in the first hidden layer, with each neuron having its own bias. Thats 784x16 weights with 16 biases from the first layer to the second. The other layers have a bunch of weights and biases associated with them too. In total this network has 784x16 + 16x16 + 16x10 weights and 16+16+10 biases, meaning 13,002 weights and biases. 13,002 parameters which can be tweaked to make the network perform in different ways.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

The actual function here is obviously quite difficult to write out. How can we make it more compact?

A

A more notationally compact way to present it is
organising all the activations from one layer into a column as a vector. Then organise all of the weights as a matrix where each row of that matrix corresponds to the connections between one layer and a particular neuron in the next layer. This means that taking the weighted sum of the activations in the first layer according to these weights corresponds to one of the terms in the matrix vector product. Instead of adding the bias to each one of these independently, we represent it by adding each one of these biases into a vector and adding the vector to the previous matrix vector product. Then as a final step we can ‘wrap’ a sigmoid around the outside here. This is supposed to represent the fact that you’ll apply the
sigmoid to each component of the resulting
vector inside. This means we can communicate
the full transitions of activations from one layer
to the next in an extremely tight and neat little
expression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

The goal is to show the algorithm a bunch of training data with labels of the category it is supposed to belong to and it will adjust those (e.g 13,000) weights and biases to improve its performance on the training data. Hopefully this layered structure will mean that it can generalise what it learns beyond the training data. This can be tested using novel test data and assessing how well it can classify that test data (number correct/ total).

What does the ‘learning’ here boil down to?

A

This ‘learning’ is essentially calculus and comes down to finding the minimum of a particular function. Remember, we are thinking of each ‘neuron’ being connected to each in the previous layer and the weights are like the
strengths of those connections.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

The values for these are first assigned randomly and at first the network should perform pretty horribly

How do we train the computer from here?

A

To train the computer based on supervised learning, you have to implement a cost function to teach it the difference between its output and the intended output.

39
Q

Describe how this could look for our previously described perceptron for a training example

A

In our previously described perceptron, a network with a 3 fed into it would have the intended output of an activation of 1 for the number 3 and an output of 0 for the other numbers. You then add the squared differences between the intended activation of each output neuron and the actual activation (e.g … (0.88-1)^2 + ( 0.72-0)^2 +…). This sum is the cost of a particular training example. This sum is small when the network confidently (less activation on incorrect neurons) classifies the image correctly. This sum is large when it doesn’t seem to know what it’s doing.

40
Q

describe how this works when integrating all the training data

A

You can then calculate the average cost over all the training data. This average cost can be a metric for how effective of a classifier it is.

41
Q

This is quite complex as the input of our simple perceptron is 784 numbers (pixels), output is 10 numbers and the parameters consist of 13,002 weights and biases. The cost function is a layer of complexity on top of that, it takes these 13,002 weights and biases as its input and outputs a single number (the cost) describing how bad those weights and biases are. The way it’s defined (the parameters) depends on the network’s behaviour over all tens of thousands of training data typical of big data datasets.

How can we imagine this in a more simple format?

A

To make this simpler, instead of imagining a function with 13,000 inputs, imagine a function with one number as an input and one as an output. We can sometimes find a value for an input that minimises the value of a function through calculus explicitly e.g when a function is a curve with a single minimum.

42
Q

This is not always feasible with really complicated functions however, especially not our super complicated neural network function. Describe a more flexible tactic

A

A more flexible tactic is to start at any input and figure out which direction you should ‘step’ to make that output lower. Specifically, if you can find the slope of the function ‘where you are’, then shift to the left if that slope is positive and to the right if that slope is negative. If you do this repeatedly, at each point checking the new slope and taking the appropriate step, you’re going to approach some local minimum of the function.

43
Q

Describe finding the minimum using a common metaphor

A

A common metaphor for this is a ball rolling down a hill. There are many possible valleys you might land in depending on what random input you start in. There’s no guarantee the local minimum you land in is going to be the smallest possible solution for the cost function.

44
Q

How easy is it to find local and global minima?

A

Finding these local minima is pretty doable, finding the global minimum of something as complex as the cost function can be crazy difficult.

45
Q

How can you prevent ‘overshooting’ a minimum?

A

If you make your step sizes proportional to the slope, then when your slope is flattening out towards the minimum, your steps get smaller and smaller. This prevents you from overshooting.

46
Q

How can we use a more complex example to conceptualise gradient descent a bit better?

A

To increase the complexity a bit, imagine a function with two inputs and one output. You might think of the input space as the x,y plane and the cost function as being graphed as a z surface above it. Instead of asking about the slope of the function, you have to ask which direction you should step in this input space so as to decrease the output of the function more quickly. In other words, what is the downhill direction?

47
Q

What form of maths can help us with this problem and how?

A

In multivariate calculus, the gradient of a function gives you the direction of steepest descent. In other words, what direction should you go to increase the function most quickly. Taking the negative of that gradient can give you the step that decreases the function most quickly. Even more than that, the length of this gradient vector is an indication for just how steep that steepest slope is. The algorithm for minimising the function is to compute this gradient direction, take a small step downhill and just repeat that continuously.

48
Q

Relate this example to the cost function

A

This is the same idea for a function which has 13,000 inputs instead of just two inputs. Imagine organising all 13k weights and biases over a network into a giant column vector. The negative gradient of the cost function is just a vector. It’s some direction inside this insanely huge input space that tells you which nudges to all of those numbers is going to cause the most rapid decrease to the cost function. Changing the weights and biases to decrease it means making the output of the network on each piece of training data look less like a random array of 10 values and more like an actual decision we want it to make.

49
Q

For what training sample(s) does a cost function of a given training example?

A

Remember, this cost function is an average over all the training data, so if you minimise it it improves the performance of all of those samples.

50
Q

What is the algorithm for computing these gradients?

A

The algorithm for computing these gradients efficiently is essentially the ‘heart’ of how neural networks learn and is called back-propagation.

51
Q

Why do neurons have continuous activations rather than binary activations as in biological neurons?

A

A consequence of the concept of minimising a cost function is that it is important for this cost function to have a nice smooth output (as in the z surface), so that we can find the local minimum by taking small steps downhill. This is why artificial neurons have continuous activations rather than being either active or inactive in a binary way as in biological neurons.

52
Q

What is this process of repeatedly nudging an input of a function called?

A

This process of repeatedly nudging an input of a function by some multiple of a negative gradient is called gradient descent. It’s a way to converge towards some local minimum of a cost function.

53
Q

Although the image of the function of two inputs is useful to conceptualise this, this is simply used because nudges in a 13,000 dimensional space are more difficult to visualise. Give a non-spatial way to think about this

A

Each component of the negative gradient tells us two things, the sign tells us whether the corresponding coordinate of the input vector should move up or down and the relative magnitudes of the components kind’ve tell you which changes matter more. The adjustment of one weight might matter a lot more to the cost function (∴ the training data) than some other weight. So a way you can think of the gradient vector of this huge cost function is that it encodes the relevant importance of each weight and bias; the changing of which parameters induces the most valuable change. This is essentially direction

54
Q

Think for example if you calculate the gradient of a function as [3,1]. What would this mean in terms of the x,y plane?

A

This essentially translates to the direction of the direction of that on the x,y plane (line with a rise of 1 and run of 3; slope: 3/1) is the direction of the steepest slope of the surface above it. This is essentially saying that the changes of one variable (e.g the function 3/2x^2 + 1⁄2 y^2) has three times the impact as changes to another variable.

55
Q

You can visualise the pixel patterns the neurons in the hidden layer picked up on. Describe how the network in the example assigned weights. Was it how you expected?

A

The network did not pick out clear interpretable shapes and lines as we had expected, but instead showed almost random looking images with loose patterns in the middle pictured on the left. The narrator stated that the network, through its different weights and biases found a nice local minimum that, despite successfully classifying most images, doesn’t really pick up on the patterns that we might have hoped for.

56
Q

How could you interpret the networks methods of detecting digits compared to our own?

A

You can input a random pixel image and it will confidently give an answer; It knows how to recognise a 5 but not how to draw one. A lot of this is because its such a tightly constrained training setup; the entire test set consists of clearly defined unmoving digits centred in a tiny grid and its cost function does not give it any incentive to be anything but certain in any of its decisions.

57
Q

Backpropagation is a method of computing the crazy complicated gradient previously described. How can you think of the gradient vector instead of as a series of directions in 13k directions?

A

You can think of it as the magnitude of each component telling you how sensitive the cost function is to each weight and bias. Say, for example, you go through the backpropagation process and the component of the weight associated with one connection was 3.20 and the weight of another component of a connection was 0.10. This could be interpreted as the changing of the first weight changes the overall value of the cost function by a magnitude 32 times greater than a change of the same magnitude of the second weight; it is 32 times more sensitive to changes in the first weight than the second weight.

58
Q

Take, for example, an image of a 2 is fed into the network at a point when the network is not well trained yet so the outputs look pretty random. How can we correct the activation?

A

The activations cannot be directly activated, only the weights and biases. It is helpful to track each of the adjustments we wish to make to that output layer. Since we want it to classify the network to classify it as a 2, we want that activation level to be nudged up and the others to be nudged down. The further the activation is from 1, the closer we want to make it to 1 for the 2 neuron and the further from 0 for the others, the more we want to nudge it down. Aka the sizes of the nudges should be proportional to how far the current value is to its target value.

59
Q

Focusing on just this neuron whose activation we wish to increase, how is the activation defined?

A

This activation is defined as a certain weighted sum of all of the activations of the previous layer plus a bias which is all plugged into something like the sigmoid function or a ReLU.

60
Q

Therefore in what different ways can you increase the activation of the neuron

A

There are therefore three different avenues they could ‘team up’ in order to increase that activation; you can increase the bias (b), you can increase the weights (wi) and you can change the activations (ai) from the previous layer.

61
Q

Do all of the weights have the same level of influence? What does this mean?

A

In regards to the weights, they have different levels of influence. Those coming from the brightest neurons of the previous layer have the strongest effect as they are multiplied by larger activation values. So increasing the weights has different levels of effectiveness for different neurons for this example (based on the activation it evokes).

62
Q

What aspect of this process of gradient descent is similar to Hebbian learning?

A

Remember when we talk about gradient descent, we don’t just care about which component we want to get nudged up or down, we care about which is the most informative with a given change. This is somewhat aligned with the process of Hebbian learning in biological neurons; those which fire together wire together. The biggest increase in weights and strengthening of connections happens between neurons which are the most active and ones which we wish to become more active. We can
therefore increase wi in proportion to ai.

63
Q

Describe the third way we can help increase this neurons activation

A

The third way we can help increase this neurons activation is by changing all the activations in the previous layer. Namely, if everything connected to that 2 neuron with a positive weight got brighter, and everything with a negative weight got dimmer, then that 2 neuron would become more active and similar to the weight changes you’re going to get the most information from differences in activation by seeking changes which are proportional to the corresponding weights. Of course we cannot
directly influence those activations, we
only have control over the weights and
biases.

64
Q

What restricts our ability to freely determine weights to detect a two?

A

Remember we not only want the 2 neuron to become more active, but for all the other neurons in the last layer to become less active and each of those output neurons has its own goals for what should happen to that second last layer. So the desire for this 2 neuron is added to the desires of all the other neurons for what should happen to this 2nd last layer in proportion to the corresponding weights and in proportion to how much each of these neurons need to change.

65
Q

Therefore summarise the process of backpropagation

A

By adding all of these desired effects you basically get a series of nudges which you want to happen to this second last layer. Once you have those, you can carry out the same process to the weights and biases which determine those values. This process is repeated backwards through the network. Remember this is all just how a single training example nudges each of those weights and biases. You go through each step of this backprop routine for each example, recording how each of them would like to change the weights and the biases. You then average together those desired changes. This collection of the averaged nudges to each weight and bias is, loosely speaking, the negative gradient of the cost function or something proportional to it.

66
Q

How fast is this process of backpropagation?

A

In practice, it takes computers an extremely long time to add up the influence of every training example of every gradient descent step.

67
Q

What is done in practice to shorten the time spent computing backprop?

A

What’s done in practice instead is that you randomise you training data and assign them to ‘mini-batches’, say each having 100 examples. You then compute a step according to the mini-batch. It won’t be the actual gradient of the cost function, which depends on all of the training data so it’s not the most efficient step downhill. Each mini-batch does give a good approximation however and results in a significant computational speed-up. If you were to visualise this on the relevant cost surface, it would look more like a winding route in quick steps down to the eventual solution rather than a more calculated route going the exact downhill direction.

68
Q

What is this form of gradient descent called and what does it require?

A

This technique is referred to as stochastic gradient descent. For this entire process to work you need big databases.

69
Q

How do you compute the error?

A

value - expected value
y - yhat:

y;yhat; error
0 0 0
0 1 -1
1 0 1
1 1 0

70
Q

What function can the perceptron without hidden layers not compute? What does it need for this?

A

XOR function: it needs multiple inputs encoding two OR functions in the input from a hidden layer and and AND in the output of the neuron (multiple line):
𝒙𝟏 𝒙𝟐 𝒚^
0 0 1
0 1 0
1 0 0
1 1 1

71
Q

How can you prevent overshooting (slightly different in slides)

A

if you change too little it will take forever; With gradient descent; if you make the steps proportionate to the slope; make big steps when far away from (local) minimum (large error), and steps get smaller and smaller as they approach it (smaller error) –> prevents overshooting

72
Q

What name does he give to stochastic learning?

A

Perturbation learning; seems to convery same idea anyway not sure

73
Q

What kind of feedback do the following types of learning have?
Hebbian
Perturbation
Backpropagation
Backprop-like learning with feedback network

A

Hebbian: no feedback ( Simple Hebbian learning cannot make meaningful changes to the blue synapse, because it does not consider this synapse’s downstream effect on the network output.)

Perturbation: scalar feedback

Backpropagation: Vector feedback

Backprop-like learning with feedback network: Vector feedback

74
Q

What does scalar feedback tell you vs vector feedback?

A

Scalar feedback : “right or wrong”;
Perturbation methods measure the change in error caused by random perturbations to neural activities (node perturbation) or synapse strengths (weight perturbation) and use this measured change as a global scalar reinforcement signal that controls whether a proposed perturbation is accepted or rejected.

Vector feedback: “Tells you whats wrong”; The backprop algorithm instead computes the synapse update required to most quickly reduce the error. In backprop, vector error signals are delivered backwards along the original path of influence for
a neuron

75
Q

What is the relationship of these methods and precision of synaptic changes?

A

Vector feedback has a higher precision of synaptic changes in reducing error but takes much longer

In order of precision:
weight pertubation (changing the weights 1 by 1 and see how output changes), node pertubation, backpropagation approximations, backpropagation

76
Q

Give the activation of an output neuron in his notation

A

k = postsynaptic
j = presynaptic
h = output
a = activation
w = weights

hk = f(ak)
ak = sum(hjwjk)

77
Q

How do we compute and implement the error in a neuron according to this notation?

A

Error is calculated at the end of the network with
E = 1/2 sum(tl - y_^l)^2
with l being an output neuron

ek = sum(𝛿l Wlk)
with 𝛿 being the difference; error signal propagated backwards

𝛿k = ek f’(a_k)

78
Q

How would you update the weights?

A

ΔWij = -n (𝛿E / 𝛿Wij) = -n hi 𝛿j
n = learning rate (?)
where 𝛿j = ej f’(aj)
= (sum_k(𝛿kWk))f’(aj)

so 𝛿j =sum all the weights of neurons in layer k inputing to neuron j times the difference in activation of k * the derived function of the activation of j

ΔWij = learning rate * output i * 𝛿j

79
Q

How many layers can this propagation be applied to?

A

Generalizes to any number of layers; Only limit is your computer’s RAM and your time

80
Q

What are the two goal functions of the brain?

A

measure how much these features have to do with one or two outputs

separation of dorsal vs ventral –> neurons are expected to become specialised early on
=> if they are thought to be similar –> both neurons help each other

81
Q

What are meant by related and unrelated goal functions?

A

Way of investigating whether we get streams when you have objective functions that are related or unrelated :

Network classifies the word/text and image that is shown simultaneously; image and text can be related or unrelated (Robin image: ‘bird’, Robin image: ‘lip’)

Unrelated: expect early specialisation; neurons in network form streams (they don’t help each other)
:: Some neurons respond to text recognition and some respond to image recognition
:: Become two streams

Related: expect: neurons specialised in the neuron later
stage –> unspecialized, one stream; neurons help each other

82
Q

What is problematic about backpropagation? (4)

A
  1. Phases: Forward phase and w output and backward phase where errors are computed
    :: can’t really relate that to biology, maybe oscillations? But no idea
  2. Weight symmetry: forward weights need to be mirrored to go back
  3. One neuron but two types of signals
    :: one related to learning, other to processing (‘probably’)
  4. Supervised learning: Really reliable on supervised learning, very rare
83
Q

What are some solutions to these problems with backpropagation? (3)

A

Separate feedback network (additional to forward pass) that calculates delta/error and communicated to the corresponding neuron in the forward pass
Or: keep feedback weights fixed

Target propagation: learn this function that needs to be back-propagated

Reinforcement-learning based feedback

84
Q

Describe how you could implement a separate feedback network

A

The output projects to another network which activates backwards. There are connections between these error neurons and neurons in the feedforward network.

85
Q

How does a separate feedback network solve previous issues?

A

More biologically plausible bc we have separate neurons that encode forward and error There is no learning in this network, the weights of the feedback are fixed and the error is computed wrongly (feedback alignment). The feedforward pass compensates for this and therefore it works.

86
Q

Describe feedback allignment works and what it solved

A

Remarkably, networks with fixed random feedback weights learn to approximately align their feedforward synaptic weights to their feedback weights. In a display of neural pragmatism, the fake error derivatives computed using the random feedback weights cause updates to the feedforward weights that make the true error derivatives closer to the fake derivatives. This surprising phenomenon, called “feedback alignment”, suggests that feedback connections do not need to be symmetric to their feedforward counterparts in order to deliver information that can be used for fast and effective weight updates

87
Q

What new problem is presented with this network?

A

Need a learning rule for this

88
Q

What is this new learning rule?

A

Target propagation

:: If desired output is known you can use a function to tell you what the output of the previous layer should be

:: error function is calculating directly –> inverse function instead

:: You can learn the correct targets for previous layer by function g

From article:
We propagate activity forward through successive layers of a network to produce a predicted output. Then we propagate an output target backwards through inverse functions (i.e. via feedback connections) that are learned through layer-wise autoencoding of the forward layers. This backward propagated target induces hidden activity targets that should have been realised by the network. In other words, if the network had achieved these hidden activities during feedforward propagation, then it would have produced the correct output. The direction in the activity space between the feedforward activity and the feedback activity indicates the direction in which the neurons’ activities should move in order to improve performance on the data. Learning proceeds by updating forward weights to minimize these local layer-wise activity differences, and it can be shown that under certain conditions the updates computed using these layer-wise activity differences approximate those that would have been prescribed by backprop.

89
Q

What problems does target propagation solve?

A

achieve the target in every layer or something

Weight symmetry problem is solved, but phase is still an issue

The algorithm provides a compelling example of how locally generated activity differences can be used to drive learning updates for multi-layer networks

90
Q

How do these DTP networks perform?

A

Difference target propagation effectively trains multi-layer neural networks on classification tasks
such as MNIST and CIFAR41, and it learns in a fraction of the time required by algorithms that
us weight or node perturbation to update weights.

Recent work shows that straightforward implementations of DTP do not perform as well as backprop on the
ImageNet task with large convolutional network

91
Q

How do these DTP networks perform?

A

Difference target propagation effectively trains multi-layer neural networks on classification tasks
such as MNIST and CIFAR41, and it learns in a fraction of the time required by algorithms that
us weight or node perturbation to update weights.

Recent work shows that straightforward implementations of DTP do not perform as well as backprop on the
ImageNet task with large convolutional network

92
Q

Name another example of biologically plausible deep learning

A

Reinforcement learning with feedback

93
Q

How does RL w/ feedback work?

A

Implements simultaneous network with interneurons

tag which neurons helped you make that action and then you get reward-prediction error

Tag all neurons involved in feedback path –> you know these neurons are only involved in that particular action

Change weights according to this

94
Q

What are benefits to RL w/ feedback?

A

You can get exact error-backpropagation, one sample at a time

Almost equal to EBP, also for big problems and it is only a bit slower