03 - Multi Layer Perceptron Flashcards
What are capcity, optimization and generalization
Capacity: the range or scope of the types of functions that the model can approximate
Optimization: minimization of training error
Generalization: the model’s ability to adapt properly to new, previously unseen data, drawn from the same distribution as the one used to create the model.
Describe a fully connected NN, incl parameter sizes.
Usual form for a neural net: h = (Wx+b)
→ h - hidden layer response, W - weight matrix, 1 vector per neuron, x - input vector, b - bias vector
Like in regression, we add a bias to be able to offset the response.
- x → nx1 vector
- h → mx1 vector
- W → mxn matrix
- Wx = a→ mx1 (pre activation response)
- b → mx1
- a+b → mx1
- h = g(a+b)
What are hyperparameters
in neural nets, we train the weights and biases. Everything else that is adjustable are hyperparameters:
- Number of layers, propagation type (fully connected, convolutional) activation function, loss function & parameters, training iterations and batch sizes
Explain the input, hidden and output layers in a NN
Input layer
- Vectorized version of input data
- Sometimes it is preprocessed
- Weights connect to the hidden layer
- Weights & biases are floats, not integers, so input needs to be converted to floats
Hidden layer(s)
- There is no answer to how many layers to use, it depends on the task and should be trained
- In the example the nr of perceptrons expands, normally a compression structure is seen
- Eg. we compress pixel values → features → class possibilities.
- Wether we see an expanding or bottleneck topology is strongly application dependend
Output Layer
- usually no activation function is used for the output layer, as the probability for all classes is wanted
- For regression
- Linear outputs with MSE (Mean Square Error)
- For classification
- Softmax units (logistic sigmoid for two classes)
- many other options for other applications
The output before softmax: o = Wx+b (logits)
Predicted label: hat y=softmax(o)
Loss is found via Negative log likelihood or cross entropy:NLL/CE: (o,y)
Normalization vs Standardization
Standardization centers data around a mean of zero and a standard deviation of one
Normalization scales data to a set range, often [0, 1], by using the minimum and maximum values.
Typical activation functions
Sigmoid:
- sigma’(a)=sigma(a)(1-sigma(a))
Tanh:
- tanh’(x)=1-tanh(x)^2
ReLU(rectified linear unit):
- relu’(x)=step(x)
- most used
You can find the derivative by using the original input which makes sigmoid and tanh already popular, but reLU is the simplest, where the gradient is just a step function
Perceptron -> Multilayer Perceptron
One-way computational chain, big function block with many free parameters
- Input processing: h_1 = g_1(W_1x+b_1)
- Processing of first hidden representation: h_2 = g_2(W_2h_1+b_2)
- …keep on going for each layer
Earlier, wideness was used, so many neurons but no deepness. This did not work, so now we go deep with and reduce the number of neurons pr layer.
MLP Max Likelihood Recap
-So we have a model p, input x and try to predict an output y
-We use ML to estimate the model parameters with a training dataset
W_{ML}=arg max E_{x~*p_data}log p(y|x)
→ the expectation is the mean over the m trianing examples
*p_data is the empirical distribution, a limited sample since we do not have access to the full population
- The model should follow this empirical distribution, such that we are able to predict future test cases
- To get high classification accuracy, we need something to optimize with, like cross entropy or negative log likelihood