3 Multilayer Perceptrons Flashcards
What is the form for fully connected NNs
h = g (Wx + b)
W: Matrix of weights, one vector per neuron
x: one input example (vector)
b: vector of biases, one scalar per neuron
h: hidden layer response
g: the activatoin function
What are the hyperparameters of multilayer perceptrons?
Number of layers
Propagation types: fully connected, convolutional
Activation functions
Loss functions and parameters
training iteration and batch sizes
What is important about the input layer?
- A vectorized version of the input data
- Somtimes preprocessed
- weights connect to the hidden layers
What is important about the hidden layer?
Number of hidden layers
Number of neurons
Topology
Design is applicaiton dependent
What is topology?
Refers to the way neurons are connected
Wheather the hidden layer os expanding or a bottleneck
often the number of neurons is reduced in the layers after the input
What are the different types of output layer?
For regression:
- Linear output with MSE
For classification:
- Softwax units
- logistic for two classes
What are the most used activation functions
Sigmoid: sigmoid (x) = 1/(1+exp(-x))
tanh: tanh(x)
ReLu: max(0,x)
ReLu learns much faster and better than the others
What is the validation set used for?
Fine tuning
What is a common method to increase capacity
More neurons
What is multilayer perceptron (MLP)
What does is look like? (hidden layers)
A feedforward network -> one-way computational chain
Input x, first hidden representation
h1 = g1(W1 x + b1)
Then next layer
h2 = g2(W2 h1 + b2)
What is the universal approximation of MLP?
We can think of Multilayer perceptron as a big function block with many free parameters.
Even with a single hidden layer, any function can be represented (as long as we use a non-linearity. We are not guaranteed, however, that the training algorithm will be able to learn that function.
in practice, single layer nets may not be trained well to a task. Instead, we go deep and reduce the number of neurons per layer
What is an Epoch?
And what happens if we use too many or to few?
A hyperparameter that is one complete pass through the entire training dataset.
one example at a time in order then the models parameters are updated based on the error made on the example.
Typically DNN are trained for a large number of epochs.
To few: may underfit the data(not able to capture underlying patterns)
Too many: may overfit (fitting noise in data rather than underlying patterns)
How can we determine optimal number of epochs
Early stopping
Cross-validation
Decides when to stop the training
Generalization error
We have only access to limited sample(not full popultion)
Empirical distribution phat_data
We want our model to predict future test cases:
It is a measure of how well the DNN is able to generalize its knowledge from the training data to new, unseen data.
What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well.
Recall Maximum likelihood
Neural model p tries to predict an output label from image
ML estimate using a training set:
W_ML = argMax E_(x~phat_data) log p(y | x)
expectation is the mean over m training images
Cross entropy
Used to determine how good the NN fits the data. (performance of the model on specific dataset)
- log(prediction)
What is overfitting
occurs when a model becomes to complex and begins to memorize the training data, rather than generalizing to new data.
fitting noise in data rather than underlying patterns
What can cause Overfitting?
And how to prevent it
Causing overfitting:
- To many layers or neurons in the network(more parameters, more complex)
- Insufficient data (small dataset, esay memorize)
- Training to many epochs
- High learning rate: cause model to converge to quickly to a suboptimal solution
Preventing overftting:
- Early stopping: Monitor the models performance on a validation during training. Stop when it degrades.
- Using data augmentation: Gives diverse training examples to help generalize the model better
What is underfitting?
Occurs when a model is not complex enough.
It is not able to capture underlying patterns
What can cause Underfitting?
And how to prevent it
Causing underfitting:
- To few layers or neurons: Not enough capacity
- Using a wrong model: using a linear model for a non-linear problem
- Poor choise of activation function
- Overuse of regularization techniques
- Insufficient data: the model may not be able to learn the underlying pattern.
- Using a too complex model fora simple task, also known as the “Goldilocks principle”
Prevent underfitting:
- Increase model complexity,
- Gather more data
- carefully tuning the model’s hyperparameters such as learning rate, batch size, and number of layers/neurons.
What is Train-val-test split:
A method used to divide a dataset inro three parts: training set, validation set and test set. (Dont leak information!)
Training set is used train, typically using an optimization algorithm like stochastic gradient descent. Model parameters are adjusted using the minimize loss function
Validation set is used to tune the models hyperparameters.
Used to evaluate the perfomance on unseen data.
Test set is used to evaluate the models performance. It is used as the final evaluation of the model, and it is the measure of how well the model generalizes to new data.