Lecture 3: Neural Networks Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Describe the concept of neural networks

A

It’s based on abstract views of the neuron. Artificial neurons are connected to form large networks. The connections determine the function of the neuron.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How may neural networks not need to be ‘programmed’?

A

Connections can often be formed by learning rather than needing to be programmed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What did Alan Turing demonstrate about computers in 1936?

A
Computers have
electronic elements that
implement ‘logic gates’
and with these you can
build and run programs.
And hence compute
anything that is
computable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give 5 examples of logic gates

A

YES (input | output)
0 | 0
1 | 1

NOT (input | output)
0 | 1
1 | 0

AND (input a | b | output)
0 | 0 | 0
0 | 1 | 0
1 | 0 | 0
1 | 1 | 1
OR (input a | b | output)
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 1
XOR (input a | b | output)
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
NAND (input a | b | output)
0 | 0 | 1
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
NOR (input a | b | output)
0 | 0 | 1
0 | 1 | 0
1 | 0 | 0
1 | 1 | 0
XNOR (input a | b | output)
0 | 0 | 1
0 | 1 | 0
1 | 0 | 0
1 | 1 | 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What did McCulloch and Pitts decide regarding these logic gates and neuroscience? (5)

A

They claimed that the brain is a computer in the Turing sense and created their theoretical neuron;

1) The activity of the neuron is an “all-or-none” process
2) A certain number of synapses must be excited within the period of latent addition in order to excite a neuron at any time, and this number is independent of previous activity and position of the neuron
3) The activity only significant delay within the nervous system is synaptic delay
4) The activity of any inhibitory synapse absolutely prevents excitation of the neuron at that time
5) the structure of the net changes with time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does this idea differ from a real neuron? Why does it differ?

A

They chose to ignore it for purposes of practicality:
neurons tire and are less likely to fire after extensive activity.
An output signal is either discrete (0/1) or it is a real valued number (between 0 and 1) although this can be likened to firing rate.
ms delays in signal speed for neurons with long axons (conductivity delays are neglected).
Absolute inhibition was required for logic gates, called vito synapses, but not used as much anymore. Most inhibitory synapses weakly prevent the neuron from firing.
Also net input is calculated as the weight sum of the input signals and transformed into an output signal via a simple function (e.g threshold). In contrast there are often complex processes going on inside the neuron.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How did it become clear that the computers built in neuroscience were quite dissimilar from the brain?

A

Because the real neurons are super unreliable; when it gets input it may fire but it also may not. It is quite a probabilistic process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What revolutionary aspect of the perceptron was quite similar to human thinking?

A

Error correcting learning: Carry out a task, evaluate what went wrong and correct those processes until the algorithm improves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A more refined version of the error correcting-learning following the perceptron. What was this called and how was it more refined?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is meant by the generalised delta rule?

A

Error-backpropogation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How many layers and nodes were in the original perception? Also describe the weights

A

2- input and output layer: image input and classification output (e.g 8, dog etc). There were also binary nodes that take values between 0 or 1 and continuous weights between the input and output, initially chosen randomly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In a simple example, an input unit with a value of 0 and another with a value of 1 (input pattern (0 1)) feed into a single output pattern which only fires when it has a value of one. The weights are 0.4 of the 0 input unit and -0.1 on the 1 unit. Explain whether or whether not the unit fires and what its net input is.

A

Since the first input had a value of 0, 0.4 x 0 = 0. The second input had a value of 1, 1 x -0.1 = -0.1. We have a net input of -0.1, which gives an output pattern of (0), due to it being a negative value, meaning the neuron does not fire.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

The same example: An input unit with a value of 0 and another with a value of 1 (input pattern (0 1)) feed into a single output pattern which only fires when it has a value of one. The weights are 0.4 of the 0 input unit and -0.1 on the 1 unit.

How could we adjust the weights so that this situation is remedied and the spontaneous output matches our target output pattern of (1)? Formulate it in the form of a rule

A

The weight of the 0 input is irrelevant as it will have a value of 0 regardless of the weight. The weight of the 1 input must be a positive value, the exact number doesn’t matter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is changing the weights so that the net input exceeds 0.0 typically carried out?

A

This can be achieved by adding a little to it such as 0.2. Typically .2 is added to all weights, ignoring those with activation 0 as these do not have any effect on the net input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

All in all, describe the perceptron algorithm in words

A

For each node in the output layer:
– Calculate the error, which can only take the values -1, 0, and 1
– If the error is 0, the goal has been achieved. Otherwise, we adjust the weights
– Do not alter weights from inactivated input nodes
– Increase the weight if the error was 1, decrease it if the error was -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe the perceptron algorithm in rules

A

weight change = some small constant x (target activation - spontaneous output
activation) x input activation

if we speak of error instead of the “target activation minus the spontaneous output activation”, we have:
weight change = some small constant x error x input activation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why not make the constant larger?

A

Although you would immediately reach your goal, you also sometimes disturb other things other things that you’ve learned. You’re not just changing one pattern, you’re changing other patterns. You want to move incrementally so that all patterns get a chance to contribute to the final solution.

18
Q

Describe the perceptron algorithm as an equation

A

/\wji = u(tj - aj) ai = udjai

19
Q

Describe the perceptron algorithm in pseudocode

A

Start with random initial weights (e.g., uniform random in [-.3,.3])
Do
{
For All Patterns p
{
For All Output Nodes j
{
CalculateActivation(j)
Error_j = TargetValue_j_for_Pattern_p - Activation_j
For All Input Nodes i To Output Node j
{
DeltaWeight = LearningConstant * Error_j * Activation_i
Weight = Weight + DeltaWeight
}
}
}
}
Until “Error is sufficiently small” Or “Time-out”

20
Q

What is meant by perceptron convergence theory?

A

Theory proposed by Rosenblatt: If a pattern set can be represented by a two-
layer Perceptron the Perceptron learning rule will always be
able to find some correct weights. (This does not mean that it will always be fast or that there will always be a solution)

21
Q

How was the perceptron a big hit?

A

> Spawned the first wave in ‘connectionism’
Great interest and optimism about the future of neural networks
First neural network hardware was built in the late fifties and early sixties

22
Q

What were the limitations of the perceptron? (2)

A

Only binary input-output values

Only two layers

23
Q

How was the binary input-output values limitations of the perceptron solved?

A

Only binary input-output values was remedied in 1960 by Widrow and Hoff, the perceptron now gave values between 0-1 rather than 0/1 (allowing for gray images not just black and white images). This was called the delta rule. Originally mainly applied by engineers until it was later shown to be equivalent to the Rescorla-Wagner rule (1976) that describes animal conditioning very well.

24
Q

Why was the two layers limitation of the perceptron a problem?

A

Minsky and Papert (1969) showed that a two-layer Perceptron cannot represent certain logical functions. Some of these are very fundamental, in particular the exclusive or (XOR, Do you want coffee XOR tea?). It was shown that an extra layer was needed to represent this, but no solid training procedure existed in 1969 to accomplish this. This caused the first AI winter.

What was needed, was an algorithm to train perceptrons with more than two layers, preferably also one that used continuous activations and non-linear activation rules.

25
Q

How and when was the two layers limitation of the perceptron solved?

A
Such an algorithm was developed by
– Paul Werbos in 1974
– David Parker in 1982
– Yann LeCun in 1984
– Rumelhart, Hinton, and Williams in 1986 (When it was really noticed.)

Rumelhart, Hinton, and Williams introduced a hidden layer to the algorithm.

26
Q

If these advancements were made in the late eighties why were they not popularised until the nineties?

A

> Had to wait too long (days or weeks) for the computer to solve the problem
Didn’t have many example patterns (e.g handwritten numerals)

27
Q

What problem was posed by adding an extra layer to the perceptron and how was this solved?

A

It is straightforward to adjust the weights to the output layer, using the Perceptron rule. But how can we adjust the weights to the hidden layer if we don’t know what activation these should be? This was solved by the backprop trick: To find the error value for a given node h in a hidden layer, Simply take the weighted sum of the errors of all nodes connected from node h i.e of all nodes that have an incoming connection from node h:

dh = w1d1 + w2d2 + w3d3 + … + wndn

This is called the back-propagation of errors or error back-propagation

28
Q

What is the biological correlate of this back-propagation in the neuron?

A

There is none no known mechanism where some signal generated post-synaptically travels back into the axon and then arrives at the soma of that node or further back.

29
Q

Describe the characteristics of back-propagation (5)

A

> Any number of layers
Only feedforward, no cycles (though a more general version does allow this)
Use continuous nodes
– Must have differentiable activation rule
– Typically, logistic: S-shape between 0 and 1
Initial weights are random (landing on a good solution can be dependent on this)
Total error never increases (gradient descent in error space)

30
Q

What is meant by a differentiable activation rule and why is it necessary?

A

The activation rule used must be differentiable in the original formulation because otherwise you can’t use the derivative of that function.

31
Q

It was mentioned that continuous nodes values are S-shaped between 0 and 1. What does this mean?

A

It forms a logistic function. It approaches a linear function around net input = 0. Its rate of change (derivative) for a node with a given activation is: activation x (1 - activation). This means that for very high inputs, either positive or negative, the derivative becomes close to 0. This is important because the weight change rule has this derivative in it and that means that if your have certain extreme input values that the learning slows down enormously.

32
Q

Why keep the derivative there is it slows down learning?

A

Only having that derivative in there can slow down descent. It is attempting to find the ‘steepest’ slope (quickest way of reducing the error). Leaving the derivative out may find this slope quicker at times but it can also go up and down while the keeping the derivative means that the error can only go down or stay stable.

33
Q

Give benefits and downsides to the gradient descent

A

Error always goes down but it may take longer than other methods, it also does not guarantee high performance and does not prevent local minima (i.e there is no back-propagation convergence theorem). The learning rule is more complicated and tends to slow down learning unnecessarily when the logistic function is used.

34
Q

Describe the back-propagation algorithm in rules

A

weight change = some small constant x error x input activation

For an output node, the error is:
error = (target activation - output activation) x output activation x (1 - output activation)
i.e error x derivative of error

For a hidden node, the error is:
error = weighted sum of to-node errors x hidden activation x (1 - hidden activation)
i.e sum of hidden layers error x derivative of hidden layers error

35
Q

This back propagation took a long time however. How did the creators tinker with it to make it faster?

A

The learning rule is often augmented with ‘momentum term.’ This consists in adding a fraction of the old weight change when there is large decreases in error in order to avoid overshooting.

36
Q

What does the new learning rule look like adapted to include the momentum term?

A

weight change = some small constant x error x input activation + momentum constant x old weight change

37
Q

If j is a node in an output layer, what is the error dj?

A

dj = (tj - aj) aj(1-aj )

where aj is the activation of node j
tj is its target activation value, and
dj its error value

38
Q

If j is a node in a hidden layer, and if there are k nodes 1, 2, …, k, that receive a connection from j, what is the error dj?

A

dj = (w1j d1 + w2j d1 + … + wkjdk) aj (1-aj ) where the weights w1j , w2j , …, wkj belong to the connections from hidden node j to nodes 1, 2, …, k.

39
Q

The backpropagation learning rule (applied at time t) is:
/\wji(t) = udjai + B/\wji(t-1)

Explain what this equation means and what values are typically given to the variables

A

/\wji (t) is the change in the weight from node i to node j at time t,

The learning constant u is typically chosen rather small (e.g., 0.05).

The momentum term B is typically chosen around 0.5 (quite small). This is optional but makes it faster, remembers what we did in a previous iteration and going to add a little bit of that for a weight change.

40
Q

Describe an application of back-propagation which demonstrated its usefulness

A

NetTalk: Back-propagation’s ‘killer-app’ was a text-to-speech converter app which took english text and fed to to a speech synthesiser (difficult because english is quite irregular). This was connectionism’s answer to DECTalk, which was hardware about the size of a freezer carrying out the same function, required linguists and years of study. It was software and learned to pronounce text with an error score comparable to DECTalk in a week, just through feeding it training examples. The input was letter-in-context, output phoneme.

The input layer had 7 groups of 29 because each of these groups represent one letter in a sentence, and took in the context of the letters surrounding a particular letter. It had 80 hidden units and 26 output units (possible pronunciations of the letter).

41
Q

What disadvantages did back-propagation have?

A

Learning is slow, new learning will rapidly overwrite old representations, unless these are interleaved (i.e., repeated) with the new patterns. This makes it hard to keep networks up-to-date with new information (e.g., dollar rate). This also makes it very implausible from as a psychological model of human memory

42
Q

What advantages did back-propagation have?

A
> Easy to use
– Few parameters to set
– Algorithm is easy to implement
> Can be applied to a wide range of data
> Is very popular
> Has contributed greatly to the ‘new connectionism’ (second wave)