3 Multilayer Perceptrons Flashcards
What is the form for fully connected NNs
h = g (Wx + b)
W: Matrix of weights, one vector per neuron
x: one input example (vector)
b: vector of biases, one scalar per neuron
h: hidden layer response
g: the activatoin function
What are the hyperparameters of multilayer perceptrons?
Number of layers
Propagation types: fully connected, convolutional
Activation functions
Loss functions and parameters
training iteration and batch sizes
What is important about the input layer?
- A vectorized version of the input data
- Somtimes preprocessed
- weights connect to the hidden layers
What is important about the hidden layer?
Number of hidden layers
Number of neurons
Topology
Design is applicaiton dependent
What is topology?
Refers to the way neurons are connected
Wheather the hidden layer os expanding or a bottleneck
often the number of neurons is reduced in the layers after the input
What are the different types of output layer?
For regression:
- Linear output with MSE
For classification:
- Softwax units
- logistic for two classes
What are the most used activation functions
Sigmoid: sigmoid (x) = 1/(1+exp(-x))
tanh: tanh(x)
ReLu: max(0,x)
ReLu learns much faster and better than the others
What is the validation set used for?
Fine tuning
What is a common method to increase capacity
More neurons
What is multilayer perceptron (MLP)
What does is look like? (hidden layers)
A feedforward network -> one-way computational chain
Input x, first hidden representation
h1 = g1(W1 x + b1)
Then next layer
h2 = g2(W2 h1 + b2)
What is the universal approximation of MLP?
We can think of Multilayer perceptron as a big function block with many free parameters.
Even with a single hidden layer, any function can be represented (as long as we use a non-linearity. We are not guaranteed, however, that the training algorithm will be able to learn that function.
in practice, single layer nets may not be trained well to a task. Instead, we go deep and reduce the number of neurons per layer
What is an Epoch?
And what happens if we use too many or to few?
A hyperparameter that is one complete pass through the entire training dataset.
one example at a time in order then the models parameters are updated based on the error made on the example.
Typically DNN are trained for a large number of epochs.
To few: may underfit the data(not able to capture underlying patterns)
Too many: may overfit (fitting noise in data rather than underlying patterns)
How can we determine optimal number of epochs
Early stopping
Cross-validation
Decides when to stop the training
Generalization error
We have only access to limited sample(not full popultion)
Empirical distribution phat_data
We want our model to predict future test cases:
It is a measure of how well the DNN is able to generalize its knowledge from the training data to new, unseen data.
What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well.
Recall Maximum likelihood
Neural model p tries to predict an output label from image
ML estimate using a training set:
W_ML = argMax E_(x~phat_data) log p(y | x)
expectation is the mean over m training images