Supervised learning Flashcards
What is the difference between the perceptron and a linear pattern associator?
the output neurons use a
continuous activaton function like the sigmoid
What does the continous output allow to do?
Quantify the error
What is the perceptron convergence theorem?
For any linearly separable problem, the perceptron will find the solution in a finite number of steps.
What is the perceptron’s network composed of?
N inputs that encode the pattern presented using values xi and a single output neuron encoding the response with bipolar (or binary) values
With what kind of output units can the Delta Rule be used?
output units that use a continuous and differentiable output function, like the sigmoid
What is the function of the delta rule?
the mean squared error between desired output and actual output
How are the weights modified during learning?
in a direction opposite to that of the
gradient of the cost function (gradient descent)
Describe the steps of learning with the delta rule in supervised learning.
- Input neurons are clamped to the input values
- Activation flows to the output neurons
- Output neurons’ activations are computed
- The output pattern is compared with the desired output
- The discrepancy between the two patterns is computed (error signal)
- Connection weights are modified ( delta rule) in order to reduce the error
(minimizing the cost function E, which
depends uniquely on the values of the connection weights W. Thus, weights are modified in a direction opposite to that of the gradient of the cost function - The procedure is repeated for all examples that form the training set (learning epoch) and it is further repeated (for many epochs) until the error becomes 0 or the error stops decreasing.
How can linearly inseperable problems be solved (by what?)
multi-layer networks
Why are multi layer networks called universal aproximators?
Because a network with at least one hidden layer can, at least in principle, approximate any X-Y (input-output) function (if we properly choose the weight values and the number of hidden units)
What is a multi-layer network?
One that has one or more intermediate layers of neurons (hidden layers) that use a non-linear activation function (like the sigmoid)
What is the error back-propagation algorithm?
it’s an extension of the delta rule (generalized delta rule) that allows learning in multi layer networks
Describe the steps of error back propagation
- Input neurons are clamped to the input values
- Activation flows to the hidden neurons -> output neurons
The output pattern is compared with the desired output - The discrepancy between the two patterns is computed (error signal)
- computing the changes according to the gradient of the error function for these connections
=> we sum all the errors for the output errors connected and we propagate them backwards - multiply each error by the weight values itself
- For the errors of the hidden units – the error is computed by summing the weighted error terms for each of the output unit (for each output unit we compute an error term and then propagating these errors backwards by multiplying the error term for each output by each weight value and then summing them all)
- Once I have the error for the hidden unit, I can apply the delta rule again because I have the inputs
At the output level – delta rule
At the level of each previous layer – generalized delta rule
What is the difference between a small and large learning rate?
small: slow + local minima
large: fast + imprecise
What does the momentum do?
- adds a fraction of the previous weight update to the current one
- it’s in the same direction, this will increase the size of the step taken towards the minimum. When the gradient changes direction, momentum will smooth the variation