Lecture 3: Neural Networks Flashcards

Question

How and when was the two layers limitation of the perceptron solved?

Answer 1

``` Such an algorithm was developed by – Paul Werbos in 1974 – David Parker in 1982 – Yann LeCun in 1984 – Rumelhart, Hinton, and Williams in 1986 (When it was really noticed.) ``` Rumelhart, Hinton, and Williams introduced a hidden layer to the algorithm.

Answer 2

> Had to wait too long (days or weeks) for the computer to solve the problem > Didn't have many example patterns (e.g handwritten numerals)

Answer 3

It is straightforward to adjust the weights to the output layer, using the Perceptron rule. But how can we adjust the weights to the hidden layer if we don't know what activation these should be? This was solved by the backprop trick: To find the error value for a given node h in a hidden layer, Simply take the weighted sum of the errors of all nodes connected from node h i.e of all nodes that have an incoming connection from node h: dh = w1d1 + w2d2 + w3d3 + ... + wndn This is called the back-propagation of errors or error back-propagation

Answer 4

There is none no known mechanism where some signal generated post-synaptically travels back into the axon and then arrives at the soma of that node or further back.

Answer 5

>Any number of layers >Only feedforward, no cycles (though a more general version does allow this) >Use continuous nodes – Must have differentiable activation rule – Typically, logistic: S-shape between 0 and 1 >Initial weights are random (landing on a good solution can be dependent on this) >Total error never increases (gradient descent in error space)

Answer 6

The activation rule used must be differentiable in the original formulation because otherwise you can't use the derivative of that function.

Answer 7

It forms a logistic function. It approaches a linear function around net input = 0. Its rate of change (derivative) for a node with a given activation is: activation x (1 - activation). This means that for very high inputs, either positive or negative, the derivative becomes close to 0. This is important because the weight change rule has this derivative in it and that means that if your have certain extreme input values that the learning slows down enormously.

Answer 8

Only having that derivative in there can slow down descent. It is attempting to find the 'steepest' slope (quickest way of reducing the error). Leaving the derivative out may find this slope quicker at times but it can also go up and down while the keeping the derivative means that the error can only go down or stay stable.

Answer 9

Error always goes down but it may take longer than other methods, it also does not guarantee high performance and does not prevent local minima (i.e there is no back-propagation convergence theorem). The learning rule is more complicated and tends to slow down learning unnecessarily when the logistic function is used.

Answer 10

weight change = some small constant x error x input activation For an output node, the error is: error = (target activation - output activation) x output activation x (1 - output activation) i.e error x derivative of error For a hidden node, the error is: error = weighted sum of to-node errors x hidden activation x (1 - hidden activation) i.e sum of hidden layers error x derivative of hidden layers error

Answer 11

The learning rule is often augmented with 'momentum term.' This consists in adding a fraction of the old weight change when there is large decreases in error in order to avoid overshooting.

Answer 12

weight change = some small constant x error x input activation + momentum constant x old weight change

Answer 13

dj = (tj - aj) aj(1-aj ) where aj is the activation of node j tj is its target activation value, and dj its error value

Answer 14

dj = (w1j d1 + w2j d1 + ... + wkjdk) aj (1-aj ) where the weights w1j , w2j , ..., wkj belong to the connections from hidden node j to nodes 1, 2, ..., k.

Answer 15

/\wji (t) is the change in the weight from node i to node j at time t, The learning constant u is typically chosen rather small (e.g., 0.05). The momentum term B is typically chosen around 0.5 (quite small). This is optional but makes it faster, remembers what we did in a previous iteration and going to add a little bit of that for a weight change.

Answer 16

NetTalk: Back-propagation’s ‘killer-app’ was a text-to-speech converter app which took english text and fed to to a speech synthesiser (difficult because english is quite irregular). This was connectionism's answer to DECTalk, which was hardware about the size of a freezer carrying out the same function, required linguists and years of study. It was software and learned to pronounce text with an error score comparable to DECTalk in a week, just through feeding it training examples. The input was letter-in-context, output phoneme. The input layer had 7 groups of 29 because each of these groups represent one letter in a sentence, and took in the context of the letters surrounding a particular letter. It had 80 hidden units and 26 output units (possible pronunciations of the letter).

Answer 17

Learning is slow, new learning will rapidly overwrite old representations, unless these are interleaved (i.e., repeated) with the new patterns. This makes it hard to keep networks up-to-date with new information (e.g., dollar rate). This also makes it very implausible from as a psychological model of human memory

Answer 18

``` > Easy to use – Few parameters to set – Algorithm is easy to implement > Can be applied to a wide range of data > Is very popular > Has contributed greatly to the ‘new connectionism’ (second wave) ```

Lecture 3: Neural Networks Flashcards

(42 cards)