Lecture #4 - Deep Learning Flashcards
What is the ADAM algorithm?
Adam is a variation of gradient descent. Estimates the moment of the gradient to adapt the step-size in each dimension separately.
The moments; how much does the gradient change. If the gradient has a big variance (change a lot), function changes a lot -> small step size.
Gradient changes a small bit -> large step-size since continuous.
Depending on the change of the gradient, we can either choose a small or big step-size.
Heuristic, but very popular heuristic. Mathematically well derived, but works.
1st moment -> mean
2nd moment -> variance.
Three parameters; learning rate, decay rate -> control how much in the past do we look at with the gradient,
Explain the comparison between the algorithms of gradient descent and its variation, Adam algorithm when applying different functions.
Adam is a lot more slower at the start, in comparison to steepest gradient descent. However, after a certain point it reaches the minimum rapidly.
Explain the use of hidden layers in Multi-Layer Perceptron and explain if the number of hidden layers effect the system.
The concept of Multi-Layer Perceptron is used exchangeable with Feed-forward Network.
The architecture of an MLP can be represented as follows:
Input Layer -> Hidden Layer 1 -> Hidden Layer 2 -> … -> Output Layer
The hidden layers in an MLP play a crucial role in feature extraction and learning complex representations of the input data. Each neuron in the hidden layers receives inputs from the previous layer, applies certain weights to those inputs, and passes the weighted sum through an activation function, introducing non-linearity to the model. The non-linearity enables the network to learn and approximate complex functions, making it more powerful than a linear model.
The hidden layer neurons allow for the function of a neural network to be broken down into specific transformations of the data.
For dimensional; 10 layers is sufficient, but the number of layers is dependent on the number of layers and dimensions.
Why is tanh used in ML and state the advantages
- Range between -1 and 1, which is useful for certain properties
- Non-linear function; introducing non-linearity is crucial for deep learning models to learn and represent complex relationships in data.
- Steeper slope in comparison to sigmoid function; alleviate vanishing gradient problems in deep NN.
- Zero centred; output is zero, when input is zero. Good for stability
State why using tanh can be a problem in ML
- Suffer from vanishing gradient problems for large or small values -> particular case when adding more layers to the NN
Explain the vanishing gradient problem.
The vanishing gradient problem occurs when the gradient values become extremely small, typically close to zero, as they are propagated from the deeper layers (closer to the output) to the shallower layers (closer to the input) of the network. Consequently, the weights in the shallower layers receive negligible updates, slowing down the learning process for those layers. As a result, the network may fail to learn meaningful representations in the early layers, limiting the overall performance of the model.
What is the difference between the loss function and the cost function
Used interchangeably, but in some contexts, can refer to slightly different concepts.
- Loss function; a measure that quantifies the difference between the predicted output of a machine learning model and the actual target values.
- Cost function; represents the overall “cost” or “penalty” incurred by the model for its predictions.
What is an epoch?
An epoch is completed when the model ahs seen and processed every training sample once.
In Stochasic Gradient Descent, we do not care at arriving at a global or local minimum.
What does this mean for the topic?
- Achieving a low loss function is often good enough.
- The mini batch size, the learning rate and the initial values play a role in where we land.
- The gradient should be as large as possible and predictable
What is batch size important for Stochastic Gradient Descent
A small batch size leads to a slightly bad approximation of the gradient in some cases and may help us escape from a local minimum.
so, therefore:
- Subdivide training sets into mini-batches B and perform learning with mini-batches
- If all mini-batches have been used once, an epoch has been expired.
- The case of |B| = 1 is called online learning.
Explain the concept of backpropagation
The training of the neural network done by using backpropagation.
- FF NN, an input x produces an output y_hat = f_hat(x, theta)
- The input x flows through the layers of the network to produce the output y_hat -> forward propagation.
- During training, after forward propagation of a back X_[train] of examples x_[train]_[i], we can calculate a loss J(theta)
The backward propagation algo. allows the information from the gradient to flow backwards through the network in order to compute the gradient.
What batch size we want in image processing and image classification
- Usually small mini-batches sizes are used (|B| =~ 100-500)
Mostly due to complexity reasons; single examples are high-dimensional, networks are large and number of available examples are small.
What type of batch size we want in communications
We want a bigger mini-batch in communications.
- Targeting low errors (<10^-4), mant examples in a mini-batch of size =~ 1000 will never incorrectly classify.
- Not much incentive to improve the classifier during many iterations to cover seldomly occuring outliers.
How do we choose a batch size in communication
Solution A: Use large mini-batch sizes, which we generate on the fly anyhow
- Overkill in the initial phase of training
- Once we are stuck in a local minimum, difficult to recover.
Solution B: Increase mini-batch size during training
- Start with a small mini-batch size (=~ 100) to rapidly converge to an approximate solution
- Then increase the mini-batch size to lower the error rate
Rule of thumb: Final mini-batch size =~ 10/SER where SER is the targeted
symbol error rate
Give the definition of a Convolutional Network
ConvNets are simply NN which use convolution instead of matrix multiplication in some of their layers.