Lecture #4 - Deep Learning Flashcards

Question 1

Q

What is the ADAM algorithm?

Answer

A

Adam is a variation of gradient descent. Estimates the moment of the gradient to adapt the step-size in each dimension separately.

The moments; how much does the gradient change. If the gradient has a big variance (change a lot), function changes a lot -> small step size.

Gradient changes a small bit -> large step-size since continuous.

Depending on the change of the gradient, we can either choose a small or big step-size.

Heuristic, but very popular heuristic. Mathematically well derived, but works.

1st moment -> mean
2nd moment -> variance.

Three parameters; learning rate, decay rate -> control how much in the past do we look at with the gradient,

Question 2

Q

Explain the comparison between the algorithms of gradient descent and its variation, Adam algorithm when applying different functions.

Answer

A

Adam is a lot more slower at the start, in comparison to steepest gradient descent. However, after a certain point it reaches the minimum rapidly.

Question 3

Q

Explain the use of hidden layers in Multi-Layer Perceptron and explain if the number of hidden layers effect the system.

Answer

A

The concept of Multi-Layer Perceptron is used exchangeable with Feed-forward Network.

The architecture of an MLP can be represented as follows:

Input Layer -> Hidden Layer 1 -> Hidden Layer 2 -> … -> Output Layer

The hidden layers in an MLP play a crucial role in feature extraction and learning complex representations of the input data. Each neuron in the hidden layers receives inputs from the previous layer, applies certain weights to those inputs, and passes the weighted sum through an activation function, introducing non-linearity to the model. The non-linearity enables the network to learn and approximate complex functions, making it more powerful than a linear model.

The hidden layer neurons allow for the function of a neural network to be broken down into specific transformations of the data.

For dimensional; 10 layers is sufficient, but the number of layers is dependent on the number of layers and dimensions.

Question 4

Q

Why is tanh used in ML and state the advantages

Answer

A

Range between -1 and 1, which is useful for certain properties
Non-linear function; introducing non-linearity is crucial for deep learning models to learn and represent complex relationships in data.
Steeper slope in comparison to sigmoid function; alleviate vanishing gradient problems in deep NN.
Zero centred; output is zero, when input is zero. Good for stability

Question 5

Q

State why using tanh can be a problem in ML

Answer

A

Suffer from vanishing gradient problems for large or small values -> particular case when adding more layers to the NN

Question 6

Q

Explain the vanishing gradient problem.

Answer

A

The vanishing gradient problem occurs when the gradient values become extremely small, typically close to zero, as they are propagated from the deeper layers (closer to the output) to the shallower layers (closer to the input) of the network. Consequently, the weights in the shallower layers receive negligible updates, slowing down the learning process for those layers. As a result, the network may fail to learn meaningful representations in the early layers, limiting the overall performance of the model.

Question 7

Q

What is the difference between the loss function and the cost function

Answer

A

Used interchangeably, but in some contexts, can refer to slightly different concepts.

Loss function; a measure that quantifies the difference between the predicted output of a machine learning model and the actual target values.
Cost function; represents the overall “cost” or “penalty” incurred by the model for its predictions.

Question 8

Q

What is an epoch?

Answer

A

An epoch is completed when the model ahs seen and processed every training sample once.

Question 9

Q

In Stochasic Gradient Descent, we do not care at arriving at a global or local minimum.

What does this mean for the topic?

Answer

A

Achieving a low loss function is often good enough.
The mini batch size, the learning rate and the initial values play a role in where we land.
The gradient should be as large as possible and predictable

Question 10

Q

What is batch size important for Stochastic Gradient Descent

Answer

A

A small batch size leads to a slightly bad approximation of the gradient in some cases and may help us escape from a local minimum.

so, therefore:

Subdivide training sets into mini-batches B and perform learning with mini-batches
If all mini-batches have been used once, an epoch has been expired.
The case of |B| = 1 is called online learning.

Question 11

Q

Explain the concept of backpropagation

Answer

A

The training of the neural network done by using backpropagation.

FF NN, an input x produces an output y_hat = f_hat(x, theta)
The input x flows through the layers of the network to produce the output y_hat -> forward propagation.
During training, after forward propagation of a back X_[train] of examples x_[train]_[i], we can calculate a loss J(theta)

The backward propagation algo. allows the information from the gradient to flow backwards through the network in order to compute the gradient.

Question 12

Q

What batch size we want in image processing and image classification

Answer

A

Usually small mini-batches sizes are used (|B| =~ 100-500)

Mostly due to complexity reasons; single examples are high-dimensional, networks are large and number of available examples are small.

Question 13

Q

What type of batch size we want in communications

Answer

A

We want a bigger mini-batch in communications.

Targeting low errors (<10^-4), mant examples in a mini-batch of size =~ 1000 will never incorrectly classify.
Not much incentive to improve the classifier during many iterations to cover seldomly occuring outliers.

Question 14

Q

How do we choose a batch size in communication

Answer

A

Solution A: Use large mini-batch sizes, which we generate on the fly anyhow

Overkill in the initial phase of training
Once we are stuck in a local minimum, difficult to recover.

Solution B: Increase mini-batch size during training

Start with a small mini-batch size (=~ 100) to rapidly converge to an approximate solution
Then increase the mini-batch size to lower the error rate
Rule of thumb: Final mini-batch size =~ 10/SER where SER is the targeted
symbol error rate

Question 15

Q

Give the definition of a Convolutional Network

Answer

A

ConvNets are simply NN which use convolution instead of matrix multiplication in some of their layers.

Question 16

Q

What are the complexity advantages of CNN

Answer

Study These Flashcards

A

Filters are generally much smaller than their inputs.
Only a subset of input interacts with filter
Parallelisation easy (as long as they are not overlapping)
Coefficients are reused by different computations
Much less parameters need to be learned.
Memory requirements reduced by order of magnitude.

Question 17

Q

What are the equivariant representations of CNN

Answer

Study These Flashcards

A

Convolution is equivariant to translation
Useful for time-series and images

Lecture #4 - Deep Learning Flashcards

(17 cards)