Multilayer Perceptron Flashcards
What is the shape of the decision boundary for a single layer perceptron?
Linear decision boundary
What is a limitation of single layer perceptron that is overcome by using a multi-layer perceptron?
The MLP can solve the XOR problem
How is a MLP trained?
Using backpropagation
Differentiate between single layer and multi layer perceptron.
A Multi Layer Perceptron (MLP) contains one or more hidden layers (apart from one input and one output layer). While a single layer perceptron can only learn linear functions, a multi layer perceptron can also learn non - linear functions.
What is the weight update rule for the gradient descent method?
“walk” in the direction yielding the maximum decrease of the network error E.
This direction is the opposite of the gradient of E.
What is the Delta rule?
The delta rule is a gradient descent learning rule for updating the weights of the inputs to artificial neurons in a single-layer neural network. It is a special case of the more general backpropagation algorithm.
Briefly describe the working of Backpropagation
– Computing the output of the network and the corresponding error,
– Computing the contribution of each weight to the error,
– Adjusting the weights accordingly (to the contribution to error).
What are the three update strategies for backpropagation?
Full Batch mode (all inputs at once; conceptually “correct”) Weights are updated after all the inputs are processed
Batch mode (a small, random sample of inputs; “approximate”) Weights are updated after all a small random sample of inputs is processed (Stochastic Gradient Descent)
On-line mode (one input at a time)
Weights are updated after processing single inputs
What is the advantage of using Stochastic Gradient Descent?
- Additional randomness helps to avoid local minima
- Huge savings of the CPU-time
- Easy to execute on GPU cards
- “Approximated gradient” works almost the same as “exact gradient” (almost the same convergence rate)
Give examples of stopping criterion and EARLY stopping criterion
– total mean squared error change: Backprop
is considered to have converged when the
absolute rate of change in the average squared
error per epoch is sufficiently small
– generalization based criterion:
After each epoch the NN is tested for
generalization using a different set of examples
(validation set). If the generalization performance
is adequate then stop. (Early Stopping)
Early Stopping:
stop training as soon at the error on
the validation set increases
What happens when there are too many or too few hidden units? How do you solve this?
- Too few hidden units may prevent the network from learning adequately the data and learning the concept.
- Too many hidden units leads to overfitting.
- We can solve this by deciding the optimum number of hidden units, using a cross-validation scheme
What kind of activation and error functions would u use for:
- Regression problems
- Binary Classification problems
- Multiclass classification
- For regression problems use linear outputs
and the Sum-Squared-Error function - For binary classification problems use logistic output unit and minimize the cross-entropy function!
- For multi-class classification problems use softmax activation function and minimize the cross-entropy function!