Training an MLP Flashcards
Simple MLP Neural Network
Feed-forward artificial MLP that learns a Boolean function from training data. A Boolean function has inputs and output as Boolean Vectors. The boolean function maps 5 input bits to 3 output bits, total number of combinations is 2^5 = 32. 20% of input-target vector pairs for testing.
Creating a training set
Create the boolean function (Any 5x3 boolean function will work) ex. ABCDE -> AB!D
Use 26 of 32 (80%) for training and 6 for testing. Store both sets separately as csvs.The input target pairs must be in random order.
Training the MLP
Goal is for the MLP to learn the function. The network will be trained on the 26 examples to hopefully guess the correct output for the training examples. Training involves finding a set of weights that will classify the training examples correctly. When an MLP is trained the topology and architecture do not change, all training does is to find a set of weights that implements the function being learnt.
Finding a set of optimal weights
Finding a set of optimal weights is not straightforward, but NP-hard. We use error backprop to find a set of weights. We hope that these weights also correctly classify the testing examples.
MLP Structure
Neurons are organized in layers. First layer in the input layer, followed by 1+ hidden layers and then the output layer. Data flow is from the inputs X to the outputs Y. Each neuron in every layer, except the outer layer is connected to every neuron in the next layer. Associated with each connection is a weight. A weight is a positive or negative real number. Each neuron in every layer, except the input layer accepts the outputs of the neurons in the previous layer as inputs.
Feed Forward
Produces an output (in all 3 neurons in the output layer) when an input is presented to the network. Involves only vector and matrix arithmetic and passing vector components through the sigmoid() nonlinear function. When learning starts the weights are randomly initialized to real numbers between 1 and -1. When training starts the network will usually produce large errors. The errors are used by Error Backprop to modify the weights. An epoch is when all training examples are passed through the network. Training usually requires many epochs (100s or 1000s).
MLP Training Notes
For every epoch you must keep track of the good and bad facts.
Good fact: when a training example is presented to the network, feedforward is used to produce the network output and all neurons in the output layer have an error less than or equal to the error threshold (0.2 usually).
Bad fact: when a training example is presented to the network, feed forward is used to produce the network output and at least one neuron in the output layer has an error greater than the error threshold (usually 0.2)
Bad facts are plotted against epochs in a graph. MLP Learning is non-monotonic.
Use of the Chain Rule
MLPs learn from training data by adjusting the weights of their connections based on a measurement of the error (loss function) in their predictions. MLP learning involves finding the minimum of a loss function, which quantifies the difference between the predicted and actual outcomes. The goal is to adjust the weights in a way that this difference (or error) is minimized across all predictions.
Gradient descent
the method used to find this minimum, it requires calculating the gradient of the loss and activation functions with respect to each weight in the network. Indicating the direction to adjust the weights to minimize error.
The loss function
a composite function, calculating its derivative for each weight involves applying the chain rule. Allows us to break down the derivative of a complex function into manageable parts. Each layer’s output is a function of its input and ultimately the loss function is a composite of all these functions.
Matrix Arithmetic
You can multiply two matrices A and B only if A columns = B rows The size of the product matrix will be A rows by B columns. Then compute the product using the dot product of rows and columns.
Error Backpropagation is Gradient Descent
Suppose you have a NN with just 2 weights, w1 and w2, and you take all possible discrete values of these two weights and plot the output error of the NN as a surface plot. This results in an error surface. Backprop tries to find the area with the lowest error, this is called gradient descent. Given starting values of w1 and w2, algorithm works its way downhill to area of lowest error.
MLP Hyperparameters
Number of hidden layers
Number of neurons in each layer
Activation Function
Learning Rate
Error Threshold
Number of Epochs
Training Termination Criteria
Evaluation metric
Neuron Bias Term
Dropout
Finding optimal hyperparameters for a given MLP is an ongoing research questions.