Deep Learning Kickoff 10 Flashcards

1
Q

Multilayer perceptron after input and before the output layer, how are hidden layers calculated?

A

∑(wi*xi)+b then passed through another layer using an activation function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What activation function is used in logistic regression?

A

Linear sigsima(x) = ax + b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Detail the Step Activation function

A
  • Historically first specification
  • Not everywhere differentiable
  • Derivative is zero
sigma(x) =  1 if x ≥ 0
sigma(x) =  0 if x < 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Detail the Sigmoid and Tanh Activation functions

A
• Sigmoid sigma(x) = 1/ exp^(-ax) + 1 (centered at 0.5)
• Tanh sigma(x) = tanh x (centered at 0)
• Differentiable, but gradients are 
killed when |x| is large 
• Also, expensive to compute
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Detail the ReLU Activation function

A

sigma(x) = max(0, x)

Pros:
• Gradients don’t die in positive region
• Computationally efficient
• Experimentally: Convergence
is faster
Cons:
• Kills gradients in negative region
• Not zero centered

gradient is 0 if x is less than 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Detail the Softplus Activation function

A

sigma(x) = ln(1 + exp^x)

Pros:
• Differentiable 
• Gradients don‘t die in both positive and negative region
Cons: 
• Kills gradients in negative region 
when |x| is large
• Not zero centered 
• Computationally expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Detail Leaky ReLU and Parametric ReLU Activation functions

A
sigma(x) = max(0.01x, x)(Leaky ReLU)
More generally, sigma(x) = max(ax, x)(Parametric ReLU)
Pros:
• Gradients don‘t die in both positive and negative region
• Computationally efficient
• Experimentally: Convergence
is faster
Cons:
• Need to decide a (hyper-parameter)
• Not zero centered
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Detail the Exponential Linear Units

A
sigma(x) =
• x if x > 0
• a(exp x − 1) if x ≤ 0
Pros:
• Gradients don’t die in both positive and 
negative region
• Computationally efficient
• Experimentally: Convergence is faster
• Closer to zero mean outputs
Cons:
• Expensive computation (exp)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Detail the Maxout Neuron

A

• sigma(x) = max(w1x, w2x)

Pros:
• Generalizes Parametric ReLU
• Provides more flexibility by allowing different w1 and w2
• Gradients don’t die
Cons:
• Doubles the number of parameters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Can we use a perceptron to model an XOR function?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can we use a perceptron to model an OR function?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Single-layer) perceptron can

only do what to data points?

A

Single-layer) perceptron can
only separate linear separable
data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

MLP with one hidden layer

is known as?

A

universal approximator

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name three loss functions and how to calculate them.

A

Square loss = (MSE)
Cross Entropy loss = − ∑(j) actualValue(j) log(predictedValue(j))
Hinge loss = ∑(j) max(0, predictedValue(j) − actualValue(j) + 1), where t is the index
of the ‘hot’ bit in t.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Give some examples of advanced optimizers for neural networks.

A

• Stochastic gradient descent (SGD) has troubles:
• Ravines: Surface is steeper on one direction, which are common around local optima
• Saddle points (i.e. points where one dimension slopes up and another slopes down)
are usually surrounded by a plateau of the same error, which makes it notoriously
hard for SGD to escape, as the gradient is close to zero in all dimensions.
• More advanced optimizers have been proposed:
• RMSProp, AdaGrad, AdaDelta, Adam, Nadam
• These methods train usually faster than SGD, but their found solution is
often not as good as by SGD
• Performance of SGD is very much reliant on a robust initialization and annealing schedule
• Possible solution: First train with Adam, fine-tune with SGD

• Shuffling and Curriculum Learning
• Shuffling: avoid providing the training examples in a meaningful order to our model
as this may bias the optimization algorithm
• Curriculum learning: for some cases where we aim to solve progressively harder
problems, supplying the training examples in a meaningful order may lead to
improved performance and better convergence
• Batch normalization
• For each mini-batch, normalize the weights to an ideal range (e.g. zero mean and a
unit variance) to ensure the gradients do not vanish or explode
• Early stopping
• monitor error on a validation set during training and stop (with some patience) if the
validation error does not improve enough
These techniques can be used alongside each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Dropout?

A

By dropping a unit out, we mean
temporarily removing it from the network,
along with all its incoming and outgoing connections.

Dropout is a regularization method that
approximates training a large number of neural networks with different
architectures in parallel.

Dropout has the effect of making the
training process noisy, forcing nodes
within a layer to probabilistically take on
more or less responsibility for the inputs.

17
Q

How can the values of dropout be set and what architectures can it be used with?

A

• Dropout can be used with most types of neural architectures, such as
dense fully connected layers, CNNs and RNNs
• Dropout rate (PyTorch): the probability of dropping out a node, where 0.0 means no dropout, and 1.0 means drop all nodes. A good value for dropout
in a hidden layer is between 0.2 and 0.5.
• Caveat: in some papers/blogs ‘dropout rate’ also means the percentage of nodes to
be learned
• Use larger network: a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.
• Weight constraint: Large weight values can be a sign of an unstable network. To counter this effect a weight constraint can be imposed to force
the norm (magnitude) of all weights in a layer to be below a specified value
(e.g. 3-4)

18
Q

How should you initialize weights?

A

Glorot Initialization - Reasonable initialization (based on linear activations, works for tanh, needs adaptation for ReLU because it is not zero centered)

19
Q

Why shouldn’t you use large random numbers to initialize weights?

A
  • E.g. real numbers uniformly randomly drawn from [-100, 100]
  • Activations may become very large
  • But gradients may be (close to) zero (consider sigmoid and tanh)
20
Q

Why shouldn’t you use small random numbers to initialize weights?

A

• E.g. real numbers uniformly randomly drawn from [-0.01, 0.01]
• Works OK, but only for small networks (a few layers, each with a few
activations)
• In deeper networks, activations become very close to zero in deeper
layers (i.e. layers far from the input, close to the output layer)