Perceptron & Backpropagation Flashcards
How does the lecturer compare a brain to a computer?
Brains compute actions from perception
=> hidden states(info stuck in brain that influences what you’re going to do): goals, plans
=> learning: Brain is flexible, learning going on
▪outcome dependent (reinforcement/supervised; ‘if i go to bed at the right time i’ll get a good nights sleep’)
▪outcome independent (unsupervised/self-supervised)
Describe how spiking neurons are modelled
early models: spikes (McCulloch Pitts)
An input affects the firing rate which can be an analogue value between 0 and 200 Hz
h = hidden layer, something that you can’t observe; A hidden layer is located between the input and output of the algorithm, in which the function applies weights to the inputs and directs them through an activation function as the output.
n neurons in hidden layer, n in next layer –> total of n^2 layers
What functions can the transfer function take?
Linear, hyperbolic tangent etc.
What is the purpose of the activation and transfer function?
Activation functions work on some threshold value and once that value is crossed the signal is triggered, While Transfer function is used to translate input to output signals.
Give a very simplified generalised model of an artificial neural network
O = f(x1, x2)
(Output is a function of input variables)
What is meant by a deep neural network?
A network with many hidden layers
Name five identification processes in a deep neural network
- Identify pixel values
- Identify edges
- Identify combinations of edges
- Identify features
- Identify combinations of features
Give an example of objective functions irl
ultimate objective function = survive and procreate
but its not just that
How are these identification process compared to physiology?
some neurons become sensitive to edges
some neurons are sensitive to parts of the face
Describe the relationship between parameters and fit
Too few parameters is insufficient and does not capture the data well enough, too many parameters fits the data too well, overfits and does not describe the underlying generalisable relationship. Taken as an analogy of a line on a graph, the lines could perfectly outline the shape in an image of an elephant but this would not generalise to other images of elephants
What is a loss function?
A loss function is a function that compares the target and predicted output values; measures how well the neural network models the training data. When training, we aim to minimise this loss between the predicted and target outputs.
What does the architecture of a network concern?
How does a certain architecture with certain rules achieve certain objective functions?
Where does neurological preprocessing take place? How are the parameters updated?
Retina does fantastic preprocessing and is fixed
What is visual processing mostly based on?
Visual processing is mostly based on contrast
What are the learning rules of an algorithm; what methods do they employ?
They use a gradient and a learning update rule. The loss function is a method of evaluating how well your machine learning algorithm models your featured data set
As a summary what does neural learning combine?
Architecture, loss function(s), learning rules
Give two examples of these neural learning rules
Cortical columns (architecture), plasticity rules
Learning rules get neural networks to do useful stuff. Give some examples of useful stuff
=> Identify digits in images (MNIST)
=> Translate Chinese-to-Dutch
=> Recommend movies based on past ratings (Netflix prize)
=> Modeling cognitive processes:
Attention, perception, etc.
What is the function of learning algorithms?
Finding suitable set of parameters (weights)
Describe the maths behind a simple neuron in a neural network
𝑠 = 𝑏 +∑𝑤i𝑥i
s = activation)
𝑤i are the weights and 𝑏 is the bias in the network.
𝑓 is termed as the activation function.
𝑓(𝑠) is the output of the neural network.
𝑓 (𝑠) = 1 𝑠 > 0; 0 s ≤ 0
Describe how this neuron works
This model can work on non-boolean values where each input connection gets associated with a weight. Here the function calculates the weighted sum and based on the threshold value provided, it gives a binary output.
Describe the initial layer of a perceptron with a 28x28 pixel image
With a given input image of 28x28 pixels, a perceptron can attempt to output which number is shown in the image. This translates to 784 ‘neurons’ each with their own activation, which can roughly correspond to how ‘bright’ it is in the image below ranging from black at 0 to white at 1.
What would the outcome layer consist of?
The last column, the output layer, comprises the nine possible numbers the network could categorise the input into. Each of these ‘neurons’ also have an activation relating to how likely the number is given the input.
What do the layers in between consist of? escribe how they work
The layers between these two are called hidden layers. This network can have e.g two layers of 16 neurons each but that is somewhat of an arbitrary choice here. The activations of one layer (e.g input layer, hidden layer) determine the activations in the next layer (e.g hidden layer, output layer). This network has already been trained to learn digits. This means that if you feed it an image with different activations in each pixel, this causes a very specific pattern of activation in the next layer, which gives some pattern to the one after it, which gives a specific pattern to the output layer. The neuron with the most activation is then selected for what the image represents.
What do we hope the hidden layers are doing?
What we hope the hidden layers are doing is something akin to how our own vision works; by piecing together various components to classify what it is we’re seeing
What might we hope the last hidden layer is doing?
9 and 8 have the same top component but a different bottom component. 4 comprises three lines. We might hope the last layer comprises these components. Any loop in the top frame of the image might activate the top neuron in the hidden layer and this would cause increased activation in output neurons which encode numbers with this feature. Going from the third layer to the last layer just requires learning which combination of components
corresponds to which digits.
What might we hope the first hidden layer is doing?
Recognising a loop can also run into some problems. One reasonable way to do this would be to first recognise the various little edges that make it up. Similarly a line is just a long edge, or a pattern of several small edges. This could be what the first hidden layer does; The initial image activates 8-10 specific little edges which in turn activates the upper loop and a long vertical line which in turn activates the number 9 in the output.
How reasonable are these ideas?
Whether this is what our final network actually does is a different story. This is the goal however and a useful way to think about it. This can also be expanded into thinking about how networks might break down more complex images, or even beyond image recognition there are a lot of things you might want to do that break down into layers of abstraction. For example, parsing speech involves taking raw audio and picking out distinct sounds which combine to make certain syllables, which combine to form words which combine to make up phrases and more abstract thoughts.
How do these neurons encode certain patterns?
For these neurons to encode certain patterns, edges etc they are assigned weights. It can be helpful to think of these weights in a little grid of their own with green showing positive values, red as negative and their brightness as their strength.
The activation of each neuron is then summed with its corresponding unit and a weighted sum is calculated. This can be thought of as the activation of each neuron being overlaid on this grid in order to determine how much they correspond with each other.
How can activation and inactivation be used?
In a rectangle of activation, for example, there might be a 3x8 grid of positive weights with negative weights either side. This would mean that the sum is largest when the middle pixels are bright but the surrounding pixels are darker (e.g top line of 7).
What is the role of the function then in this process?
When you compute sums like this you can come out with any number but we want numbers between 0 and 1. We commonly therefore use a function to convert the range to 0 and 1.
Describe a function commonly used for this
A common function which does this is the sigmoid function, also known as the logistic curve. This function converts very negative inputs to close to 0 and very positive inputs to close to one and steadily increases around the input 0. The activation of a neuron is therefore a weighted sum of how positive the relevant weighted sum is.
Why might we want to input bias in our model?
You may only want a particular neuron to become meaningfully active or fire when it the weighted sum exceeds a certain threshold. In other words you want a bias for inactivity. To do this we simply add a negative number to the weighted sum before plugging it into the sigmoid function (e.g f(w1a1 +…. + wnan -10)). This additional number is called the bias.
Therefore what do the weights and bias tell you?
The weights tell you which pixel pattern this neuron in the second layer is picking up on and the bias is how high the weighted sum needs to be before the neuron starts becoming meaningfully active. This is carried out for every single neuron in the hidden layer.
How many parameters would this model have?
That is 784 weights per individual neuron in the first hidden layer, with each neuron having its own bias. Thats 784x16 weights with 16 biases from the first layer to the second. The other layers have a bunch of weights and biases associated with them too. In total this network has 784x16 + 16x16 + 16x10 weights and 16+16+10 biases, meaning 13,002 weights and biases. 13,002 parameters which can be tweaked to make the network perform in different ways.
The actual function here is obviously quite difficult to write out. How can we make it more compact?
A more notationally compact way to present it is
organising all the activations from one layer into a column as a vector. Then organise all of the weights as a matrix where each row of that matrix corresponds to the connections between one layer and a particular neuron in the next layer. This means that taking the weighted sum of the activations in the first layer according to these weights corresponds to one of the terms in the matrix vector product. Instead of adding the bias to each one of these independently, we represent it by adding each one of these biases into a vector and adding the vector to the previous matrix vector product. Then as a final step we can ‘wrap’ a sigmoid around the outside here. This is supposed to represent the fact that you’ll apply the
sigmoid to each component of the resulting
vector inside. This means we can communicate
the full transitions of activations from one layer
to the next in an extremely tight and neat little
expression.
The goal is to show the algorithm a bunch of training data with labels of the category it is supposed to belong to and it will adjust those (e.g 13,000) weights and biases to improve its performance on the training data. Hopefully this layered structure will mean that it can generalise what it learns beyond the training data. This can be tested using novel test data and assessing how well it can classify that test data (number correct/ total).
What does the ‘learning’ here boil down to?
This ‘learning’ is essentially calculus and comes down to finding the minimum of a particular function. Remember, we are thinking of each ‘neuron’ being connected to each in the previous layer and the weights are like the
strengths of those connections.