Multilayer Perceptrons Flashcards
Architecture-wise inspirations from biology
- simple elements
- massively parallel systems
- low precision and robustness
- distributed representation of information
- no separation between data and program
Learning-wise inspirations from biology (inductive learning)
- data driven
- adaptive systems learning and self-organization vs. deduction and programming
- biologically inspired learning rules
The basic features of multilayer perceptrons
- The model of each neuron in the network includes a nonlinear activation function that is differentiable.
- The network contains one or more layers that are hidden from both the input and output nodes.
- The network exhibits a high degree of connectivity, the extent of which is determined by synaptic weights of the network.
Recurrent neural network application
- models of dynamic systems
- spatio-temporal pattern analysis
- sequence processing
- associative memory and pattern completion (Hopfield networks, Boltzmann machines, infinite impulse response (IIR) networks, long short-term memory networks (LSTM))
- Brain is a recurrent system (!)
Feed-forward neural networks application
- association between variables
- prediction of attributes (Multilayer-Perceptron (MLP), Radial Basis Function (RBF) network, Support Vector Machines (SVM))
Prediction of attributes
Regression (find the best fit for one/several values) and classification (classify novel data to one of the attributes)
Error or cost functions
The cost function quantifies the cost of a wrong prediction and determines the “goodness” of a solution (the more wrong the prediction, the higher the cost).
Cost function types
Quadratic error function gives a balance of punishing very wrong answers, but forgiving small errors; linear EF is similar but punishes small error more and larger errors less that quadratic EF; maximum penalty caps the cost of the wrong prediction; sometimes, small errors can be tolerated and not punished at all. Quadratic error function is the maximum Likelihood estimator for Gaussian noise.
Generalization error
measures the performance of the model by finding the average error cost per prediction
Gradient descent
is an optimization method that interprets the training error as an error landscape over the model parameters w, so that an improvement of the model will be achieved by changing w in the direction opposite of the steepest gradient in the error landscape.
The partial derivative of the individual cost can be split up via ______ into ___________
- applying the chain rule,
2. one term which depends on the cost/error function and a second term which depends on the model class
The main problems with the standard gradient descent
- It doesn’t always converge and does so quite slowly
- How to choose the correct step size on the gradient.
- Easily gets stuck in local minima.
Backpropagation
It is the repeated use of the chain rule and clever notation in the form of recursively defined local errors to obtain the following gradient. Backpropagation of errors is a computationally efficient method for calculating the derivatives required to determine the parameters of Multilayer-Perceptrons (MLPs) via gradient descent.
The backpropagation phases (first)
- In the forward phase, the synaptic weights of the network are fixed and the input signal is propagated through the network, from parents to children, until it reaches the output. The function signals of the network are computed on a neuron-by-neuron basis. Thus, in this phase, changes are confined to the activation potentials and outputs of the neurons in the network. Forward propagation step: calculation of activities
The backpropagation phases (second)
- In the backward phase, an error signal is produced by comparing the output of the network with a desired response. The resulting error signal is propagated through the network, but this time the propagation is performed in the backward direction, from children to parents. In this second phase, successive adjustments are made to the synaptic weights of the network. The backward pass starts at the output layer by passing the error signals leftward through the network, layer by layer, and recursively computing the δ (i.e., the local gradient) for each neuron. This recursive process permits the synaptic weights of the network to undergo changes in accordance with the delta rule. Backpropagation step: calculation of ”local errors”