Deep Learning Kickoff 10 Flashcards
Multilayer perceptron after input and before the output layer, how are hidden layers calculated?
∑(wi*xi)+b then passed through another layer using an activation function
What activation function is used in logistic regression?
Linear sigsima(x) = ax + b
Detail the Step Activation function
- Historically first specification
- Not everywhere differentiable
- Derivative is zero
sigma(x) = 1 if x ≥ 0 sigma(x) = 0 if x < 0
Detail the Sigmoid and Tanh Activation functions
• Sigmoid sigma(x) = 1/ exp^(-ax) + 1 (centered at 0.5) • Tanh sigma(x) = tanh x (centered at 0) • Differentiable, but gradients are killed when |x| is large • Also, expensive to compute
Detail the ReLU Activation function
sigma(x) = max(0, x)
Pros: • Gradients don’t die in positive region • Computationally efficient • Experimentally: Convergence is faster Cons: • Kills gradients in negative region • Not zero centered
gradient is 0 if x is less than 0
Detail the Softplus Activation function
sigma(x) = ln(1 + exp^x)
Pros: • Differentiable • Gradients don‘t die in both positive and negative region Cons: • Kills gradients in negative region when |x| is large • Not zero centered • Computationally expensive
Detail Leaky ReLU and Parametric ReLU Activation functions
sigma(x) = max(0.01x, x)(Leaky ReLU) More generally, sigma(x) = max(ax, x)(Parametric ReLU) Pros: • Gradients don‘t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster Cons: • Need to decide a (hyper-parameter) • Not zero centered
Detail the Exponential Linear Units
sigma(x) = • x if x > 0 • a(exp x − 1) if x ≤ 0 Pros: • Gradients don’t die in both positive and negative region • Computationally efficient • Experimentally: Convergence is faster • Closer to zero mean outputs Cons: • Expensive computation (exp)
Detail the Maxout Neuron
• sigma(x) = max(w1x, w2x)
Pros: • Generalizes Parametric ReLU • Provides more flexibility by allowing different w1 and w2 • Gradients don’t die Cons: • Doubles the number of parameters
Can we use a perceptron to model an XOR function?
No
Can we use a perceptron to model an OR function?
Yes
Single-layer) perceptron can
only do what to data points?
Single-layer) perceptron can
only separate linear separable
data points
MLP with one hidden layer
is known as?
universal approximator
Name three loss functions and how to calculate them.
Square loss = (MSE)
Cross Entropy loss = − ∑(j) actualValue(j) log(predictedValue(j))
Hinge loss = ∑(j) max(0, predictedValue(j) − actualValue(j) + 1), where t is the index
of the ‘hot’ bit in t.
Give some examples of advanced optimizers for neural networks.
• Stochastic gradient descent (SGD) has troubles:
• Ravines: Surface is steeper on one direction, which are common around local optima
• Saddle points (i.e. points where one dimension slopes up and another slopes down)
are usually surrounded by a plateau of the same error, which makes it notoriously
hard for SGD to escape, as the gradient is close to zero in all dimensions.
• More advanced optimizers have been proposed:
• RMSProp, AdaGrad, AdaDelta, Adam, Nadam
• These methods train usually faster than SGD, but their found solution is
often not as good as by SGD
• Performance of SGD is very much reliant on a robust initialization and annealing schedule
• Possible solution: First train with Adam, fine-tune with SGD
• Shuffling and Curriculum Learning
• Shuffling: avoid providing the training examples in a meaningful order to our model
as this may bias the optimization algorithm
• Curriculum learning: for some cases where we aim to solve progressively harder
problems, supplying the training examples in a meaningful order may lead to
improved performance and better convergence
• Batch normalization
• For each mini-batch, normalize the weights to an ideal range (e.g. zero mean and a
unit variance) to ensure the gradients do not vanish or explode
• Early stopping
• monitor error on a validation set during training and stop (with some patience) if the
validation error does not improve enough
These techniques can be used alongside each other
What is Dropout?
By dropping a unit out, we mean
temporarily removing it from the network,
along with all its incoming and outgoing connections.
Dropout is a regularization method that
approximates training a large number of neural networks with different
architectures in parallel.
Dropout has the effect of making the
training process noisy, forcing nodes
within a layer to probabilistically take on
more or less responsibility for the inputs.
How can the values of dropout be set and what architectures can it be used with?
• Dropout can be used with most types of neural architectures, such as
dense fully connected layers, CNNs and RNNs
• Dropout rate (PyTorch): the probability of dropping out a node, where 0.0 means no dropout, and 1.0 means drop all nodes. A good value for dropout
in a hidden layer is between 0.2 and 0.5.
• Caveat: in some papers/blogs ‘dropout rate’ also means the percentage of nodes to
be learned
• Use larger network: a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.
• Weight constraint: Large weight values can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force
the norm (magnitude) of all weights in a layer to be below a specified value
(e.g. 3-4)
How should you initialize weights?
Glorot Initialization - Reasonable initialization (based on linear activations, works for tanh, needs adaptation for ReLU because it is not zero centered)
Why shouldn’t you use large random numbers to initialize weights?
- E.g. real numbers uniformly randomly drawn from [-100, 100]
- Activations may become very large
- But gradients may be (close to) zero (consider sigmoid and tanh)
Why shouldn’t you use small random numbers to initialize weights?
• E.g. real numbers uniformly randomly drawn from [-0.01, 0.01]
• Works OK, but only for small networks (a few layers, each with a few
activations)
• In deeper networks, activations become very close to zero in deeper
layers (i.e. layers far from the input, close to the output layer)