Deep Learning Kickoff 10 Flashcards

Question 1

Q

Multilayer perceptron after input and before the output layer, how are hidden layers calculated?

Answer

A

∑(wi*xi)+b then passed through another layer using an activation function

Question 2

Q

What activation function is used in logistic regression?

Answer

A

Linear sigsima(x) = ax + b

Question 3

Q

Detail the Step Activation function

Answer

A

Historically first specification
Not everywhere differentiable
Derivative is zero

sigma(x) =  1 if x ≥ 0
sigma(x) =  0 if x < 0

Question 4

Q

Detail the Sigmoid and Tanh Activation functions

Answer

A

• Sigmoid sigma(x) = 1/ exp^(-ax) + 1 (centered at 0.5)
• Tanh sigma(x) = tanh x (centered at 0)
• Differentiable, but gradients are 
killed when |x| is large 
• Also, expensive to compute

Question 5

Q

Detail the ReLU Activation function

Answer

A

sigma(x) = max(0, x)

Pros:
• Gradients don’t die in positive region
• Computationally efficient
• Experimentally: Convergence
is faster
Cons:
• Kills gradients in negative region
• Not zero centered

gradient is 0 if x is less than 0

Question 6

Q

Detail the Softplus Activation function

Answer

A

sigma(x) = ln(1 + exp^x)

Pros:
• Differentiable 
• Gradients don‘t die in both positive and negative region
Cons: 
• Kills gradients in negative region 
when |x| is large
• Not zero centered 
• Computationally expensive

Question 7

Q

Detail Leaky ReLU and Parametric ReLU Activation functions

Answer

A

sigma(x) = max(0.01x, x)(Leaky ReLU)
More generally, sigma(x) = max(ax, x)(Parametric ReLU)
Pros:
• Gradients don‘t die in both positive and negative region
• Computationally efficient
• Experimentally: Convergence
is faster
Cons:
• Need to decide a (hyper-parameter)
• Not zero centered

Question 8

Q

Detail the Exponential Linear Units

Answer

A

sigma(x) =
• x if x > 0
• a(exp x − 1) if x ≤ 0
Pros:
• Gradients don’t die in both positive and 
negative region
• Computationally efficient
• Experimentally: Convergence is faster
• Closer to zero mean outputs
Cons:
• Expensive computation (exp)

Question 9

Q

Detail the Maxout Neuron

Answer

A

• sigma(x) = max(w1x, w2x)

Pros:
• Generalizes Parametric ReLU
• Provides more flexibility by allowing different w1 and w2
• Gradients don’t die
Cons:
• Doubles the number of parameters

Question 10

Q

Can we use a perceptron to model an XOR function?

Question 11

Q

Can we use a perceptron to model an OR function?

Question 12

Q

Single-layer) perceptron can

only do what to data points?

Answer

A

Single-layer) perceptron can
only separate linear separable
data points

Question 13

Q

MLP with one hidden layer

is known as?

Answer

A

universal approximator

Question 14

Q

Name three loss functions and how to calculate them.

Answer

A

Square loss = (MSE)
Cross Entropy loss = − ∑(j) actualValue(j) log(predictedValue(j))
Hinge loss = ∑(j) max(0, predictedValue(j) − actualValue(j) + 1), where t is the index
of the ‘hot’ bit in t.

Question 15

Q

Give some examples of advanced optimizers for neural networks.

Answer

A

• Stochastic gradient descent (SGD) has troubles:
• Ravines: Surface is steeper on one direction, which are common around local optima
• Saddle points (i.e. points where one dimension slopes up and another slopes down)
are usually surrounded by a plateau of the same error, which makes it notoriously
hard for SGD to escape, as the gradient is close to zero in all dimensions.
• More advanced optimizers have been proposed:
• RMSProp, AdaGrad, AdaDelta, Adam, Nadam
• These methods train usually faster than SGD, but their found solution is
often not as good as by SGD
• Performance of SGD is very much reliant on a robust initialization and annealing schedule
• Possible solution: First train with Adam, fine-tune with SGD

• Shuffling and Curriculum Learning
• Shuffling: avoid providing the training examples in a meaningful order to our model
as this may bias the optimization algorithm
• Curriculum learning: for some cases where we aim to solve progressively harder
problems, supplying the training examples in a meaningful order may lead to
improved performance and better convergence
• Batch normalization
• For each mini-batch, normalize the weights to an ideal range (e.g. zero mean and a
unit variance) to ensure the gradients do not vanish or explode
• Early stopping
• monitor error on a validation set during training and stop (with some patience) if the
validation error does not improve enough
These techniques can be used alongside each other

Question 16

Q

What is Dropout?

Answer

A

By dropping a unit out, we mean
temporarily removing it from the network,
along with all its incoming and outgoing connections.

Dropout is a regularization method that
approximates training a large number of neural networks with different
architectures in parallel.

Dropout has the effect of making the
training process noisy, forcing nodes
within a layer to probabilistically take on
more or less responsibility for the inputs.

Question 17

Q

How can the values of dropout be set and what architectures can it be used with?

Answer

A

• Dropout can be used with most types of neural architectures, such as
dense fully connected layers, CNNs and RNNs
• Dropout rate (PyTorch): the probability of dropping out a node, where 0.0 means no dropout, and 1.0 means drop all nodes. A good value for dropout
in a hidden layer is between 0.2 and 0.5.
• Caveat: in some papers/blogs ‘dropout rate’ also means the percentage of nodes to
be learned
• Use larger network: a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.
• Weight constraint: Large weight values can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force
the norm (magnitude) of all weights in a layer to be below a specified value
(e.g. 3-4)

Question 18

Q

How should you initialize weights?

Answer

A

Glorot Initialization - Reasonable initialization (based on linear activations, works for tanh, needs adaptation for ReLU because it is not zero centered)

Question 19

Q

Why shouldn’t you use large random numbers to initialize weights?

Answer

A

E.g. real numbers uniformly randomly drawn from [-100, 100]
Activations may become very large
But gradients may be (close to) zero (consider sigmoid and tanh)

Question 20

Q

Why shouldn’t you use small random numbers to initialize weights?

Answer

A

• E.g. real numbers uniformly randomly drawn from [-0.01, 0.01]
• Works OK, but only for small networks (a few layers, each with a few
activations)
• In deeper networks, activations become very close to zero in deeper
layers (i.e. layers far from the input, close to the output layer)