Deep Learning Kickoff 10 Flashcards
Multilayer perceptron after input and before the output layer, how are hidden layers calculated?
∑(wi*xi)+b then passed through another layer using an activation function
What activation function is used in logistic regression?
Linear sigsima(x) = ax + b
Detail the Step Activation function
- Historically first specification
- Not everywhere differentiable
- Derivative is zero
sigma(x) = 1 if x ≥ 0 sigma(x) = 0 if x < 0
Detail the Sigmoid and Tanh Activation functions
• Sigmoid sigma(x) = 1/ exp^(-ax) + 1 (centered at 0.5) • Tanh sigma(x) = tanh x (centered at 0) • Differentiable, but gradients are killed when |x| is large • Also, expensive to compute
Detail the ReLU Activation function
sigma(x) = max(0, x)
Pros: • Gradients don’t die in positive region • Computationally efficient • Experimentally: Convergence is faster Cons: • Kills gradients in negative region • Not zero centered
gradient is 0 if x is less than 0
Detail the Softplus Activation function
sigma(x) = ln(1 + exp^x)
Pros: • Differentiable • Gradients don‘t die in both positive and negative region Cons: • Kills gradients in negative region when |x| is large • Not zero centered • Computationally expensive
Detail Leaky ReLU and Parametric ReLU Activation functions
sigma(x) = max(0.01x, x)(Leaky ReLU) More generally, sigma(x) = max(ax, x)(Parametric ReLU) Pros: • Gradients don‘t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster Cons: • Need to decide a (hyper-parameter) • Not zero centered
Detail the Exponential Linear Units
sigma(x) = • x if x > 0 • a(exp x − 1) if x ≤ 0 Pros: • Gradients don’t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster • Closer to zero mean outputs Cons: • Expensive computation (exp)
Detail the Maxout Neuron
• sigma(x) = max(w1x, w2x)
Pros: • Generalizes Parametric ReLU • Provides more flexibility by allowing different w1 and w2 • Gradients don’t die Cons: • Doubles the number of parameters
Can we use a perceptron to model an XOR function?
No
Can we use a perceptron to model an OR function?
Yes
Single-layer) perceptron can
only do what to data points?
Single-layer) perceptron can
only separate linear separable
data points
MLP with one hidden layer
is known as?
universal approximator
Name three loss functions and how to calculate them.
Square loss = (MSE)
Cross Entropy loss = − ∑(j) actualValue(j) log(predictedValue(j))
Hinge loss = ∑(j) max(0, predictedValue(j) − actualValue(j) + 1), where t is the index
of the ‘hot’ bit in t.
Give some examples of advanced optimizers for neural networks.
• Stochastic gradient descent (SGD) has troubles:
• Ravines: Surface is steeper on one direction, which are common around local optima
• Saddle points (i.e. points where one dimension slopes up and another slopes down)
are usually surrounded by a plateau of the same error, which makes it notoriously
hard for SGD to escape, as the gradient is close to zero in all dimensions.
• More advanced optimizers have been proposed:
• RMSProp, AdaGrad, AdaDelta, Adam, Nadam
• These methods train usually faster than SGD, but their found solution is
often not as good as by SGD
• Performance of SGD is very much reliant on a robust initialization and annealing schedule
• Possible solution: First train with Adam, fine-tune with SGD
• Shuffling and Curriculum Learning
• Shuffling: avoid providing the training examples in a meaningful order to our model
as this may bias the optimization algorithm
• Curriculum learning: for some cases where we aim to solve progressively harder
problems, supplying the training examples in a meaningful order may lead to
improved performance and better convergence
• Batch normalization
• For each mini-batch, normalize the weights to an ideal range (e.g. zero mean and a
unit variance) to ensure the gradients do not vanish or explode
• Early stopping
• monitor error on a validation set during training and stop (with some patience) if the
validation error does not improve enough
These techniques can be used alongside each other