C3 Flashcards
types of decision regions
- network with a single node -> for separation with just one line
- one-hidden layer network -> realize a convex region: each hidden node realizes on of the lines bounding the region
- two-hidden layer network -> realizes the union of three convex regions
how to train multi-layer networks?
replace the sign function by its smooth approximation and use the gradient descent algorithm to find weights that minimize the error
weight update rule
gradient descent method: walk in the direction yielding the maximum decrease of the network error E
Δw_ji = −eta * 𝜕E / 𝜕w_ji
w_ji = wji + Δw_ji
backpropagation algorithm
the algorithm searches for weight values that minimize the total error of the network
consists of the repeated application of these two phases:
- forward pass: network is activated on one example and the error of each neuron of the output layer is computed, and also the activations of all hidden nodes
- backward pass: network error is used for updating the weights. Starting at the output layer, the error is propagated backwards through the network, layer by layer, with help of the generalized delta rule. Finally all weights are updated.
3 update strategies
- full batch mode: weights are updated after all the inputs are processed
- (mini) batch mode: weights are updated after a small random sample of inputs is processed (Stochastic Gradient Descent)
- one-line mode: weights are updated after processing single inputs
advantages Stochastic Gradient Descent
- additional randomness helps to avoid local minima
- huge savings of CPU time
- easy to execute on GPU cards
stopping criteria
- total mean squared error change: backpropagation is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small
- generalization based criterion: after each epoch the network is tested for
generalization using a different set of examples (validation set). If the generalization performance is adequate then stop (Early Stopping: avoid overfitting)
3 common error functions with corresponding activation functions of the output layer
- linear => SSE (sum of squared errors) (regression)
- logistic => cross-entropy (binary)
- softmax => cross-entropy + softmax (multiclass)