Neural networks Flashcards

1
Q

What is representation learning?

A

It’s the idea of learning basis functions so the model “learns” its own (higher dimension) representation of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the best suited NN output activation functions for regression and (binary and multi-class) classification?

A
  • Regression: Identity
  • Binary classification: Sigmoid
  • Multi-class classification: Softmax
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the universal approximation theorem?

A

A single hidden layer neural network with a linear output unit can approximate any continuous function arbitrarily well, given enough hidden units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is computed during backpropagation?

A

Backpropagation correctly computes the derivative of the network function with respect to the input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different ways to fight gradient vanishing?

A
  • (Leaky) Rectified Linear Units (ReLU)
  • Greedy layerwise pre-training
  • Skipping connections
  • Batch normalization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to fight exploding gradient?

A

By using L2 regularization and/or gradient clipping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to improve gradient descent?

A
  • (Mini)batch or stochastic gradient descent
  • Momentum, i.e. considering the gradient of the previous step:
    θ_t+1 = θ_t - v_t, with v_t = γ * v_t-1) + η * ∇J(θ_t)
    (γ is the momentum coefficient)
  • Nesterov Accelerated Gradient (NAG), i.e. computing the “look-ahead gradient” after applying momentum:
    v_t = γ * v_t-1) + η * ∇J(θ_t - γ * v_t-1))
  • Adaptive learning rate, i.e. for each parameter i:
    θ_t+1,i = θ_t,i - α / sqrt(Σ(p=1,2,…t) g_p,i² + ε) * g_t,i
    or
    θ_t+1,i = θ_t,i - α / sqrt(E[g²]t + ε) * g_t,i, with E[g²]t = ρ*E[g²]t-1 + (1-ρ)*g_t²
    (ρ ≈ 0.9 is the decay constant)
  • Adaptive momentum estimation (Adam):
    θ_t+1,i = θ_t,i - α / sqrt[v_t / (1-β2t) + ε]) * m_t / (1-β2t), with
    m_t = β1*m_t-1 + (1-β1)*g_t
    v_t = β2*v_t-1 + (1-β2)*g_t²
    β1 ≈ 0.9, β2 ≈ 0.999
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Using adaptive learning rate, how is the learning rate impacted by the gradient?

A

Sparse gradient –> high l.r.
Frequent gradient –> low l.r.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the best suited models for image and sequence data?

A

Image –> Convolutional NN + Transformers
Sequence –> Recurrent NN + Transformers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the best optimizers to use?

A

SGDM, RMSProp, Adam (1.5 order), Shampoo (2 order)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are good practices when training a NN?

A
  • Shuffle data between each epoch
  • Model selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly