Backprop Flashcards
What is backprop?
- message passing
- schedule for computing weight updates (according to gradient descent) in layered networks of neurons of any depth
- any layer can be seen as an independent processor passing messages forward and backwards
Complexity of Backprop
O|w|
Large networks possible - 10^4-10^6 weights
Works on DAGs of continuous units
Loops acceptable with a time-delay
Complexity - loading problem
NP-complete
Expressive power of MLP
A single hidden-layer network can approximate every input-output mapping, provided there is enough units in the hidden layer
VC dimension of SLP
n+1
VC dimension of MLP
2(n+1)*M(1+logM)
Backprop in 1-2 sentences
The backprop procedure to compute the gradient of an objective function wrt weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. (LeCun, Bengio, Hinton, 2015).
Key insight: the derivative (or gradient) of the objective wrt the input of a module can be computed by working backwards from the gradient with respect to the output of that module.
Can be applied repeatedly to propagate the gradients through all modules, starting from the output at the top (where the net produces its predictions) all the way to the bottom (where the external data is fed).
What are hidden layers?
Distort the input in a non-linear way so that the categories become linearly separable by the last layer