Neural networks Flashcards
What is representation learning?
It’s the idea of learning basis functions so the model “learns” its own (higher dimension) representation of the data.
What are the best suited NN output activation functions for regression and (binary and multi-class) classification?
- Regression: Identity
- Binary classification: Sigmoid
- Multi-class classification: Softmax
What is the universal approximation theorem?
A single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units.
What is computed during backpropagation?
Backpropagation correctly computes the derivative of the network function with respect to the input.
What are the different ways to fight gradient vanishing?
- (Leaky) Rectified Linear Units (ReLU)
- Greedy layerwise pre-training
- Skipping connections
- Batch normalization
How to fight exploding gradient?
By using L2 regularization and/or gradient clipping.
How to improve gradient descent?
- (Mini)batch or stochastic gradient descent
-
Momentum, i.e. considering the gradient of the previous step:
θ_t+1 = θ_t - v_t, with v_t = γ * v_t-1) + η * ∇J(θ_t)
(γ is the momentum coefficient) -
Nesterov Accelerated Gradient (NAG), i.e. computing the “look-ahead gradient” after applying momentum:
v_t = γ * v_t-1) + η * ∇J(θ_t - γ * v_t-1)) -
Adaptive learning rate, i.e. for each parameter i:
θ_t+1,i = θ_t,i - α / sqrt(Σ(p=1,2,…t) g_p,i² + ε) * g_t,i
or
θ_t+1,i = θ_t,i - α / sqrt(E[g²]t + ε) * g_t,i, with E[g²]t = ρ*E[g²]t-1 + (1-ρ)*g_t²
(ρ ≈ 0.9 is the decay constant) -
Adaptive momentum estimation (Adam):
θ_t+1,i = θ_t,i - α / sqrt[v_t / (1-β2t) + ε]) * m_t / (1-β2t), with
m_t = β1*m_t-1 + (1-β1)*g_t
v_t = β2*v_t-1 + (1-β2)*g_t²
β1 ≈ 0.9, β2 ≈ 0.999
Using adaptive learning rate, how is the learning rate impacted by the gradient?
Sparse gradient –> high l.r.
Frequent gradient –> low l.r.
What are the best suited models for image and sequence data?
Image –> Convolutional NN + Transformers
Sequence –> Recurrent NN + Transformers
What are the best optimizers to use?
SGDM, RMSProp, Adam (1.5 order), Shampoo (2 order)
What are good practices when training a NN?
- Shuffle data between each epoch
- Model selection