Lec4,5 - Artificial Neural Networks Flashcards

Question 1

Q

Give the equation for the output of a single neuron.

Answer

A

y = f(w^T * x)

Question 2

Q

What is a feed-forward network?

Answer

A

It is a network of stacked fully connected layers (each having a certain number of neurons). eg:
y = h3(h2(h1(x)))
where h_i is a Layer

Question 3

Q

Describe two methods for weight initialisation and one for bias initialisation.

Answer

A

Weights:

Normal Distribution
Xavier/Glorot Initialisation

Bias:
- Set to zeros.

Question 4

Q

Why do we use activation functions?

Answer

A

𝑊𝑥+𝑏 on it’s own doesn’t allow us to learn non-linear functions such that the output varies non-linearly with respect to the inputs. Therefore, to introduce non-linearity into the network we apply non-linear activation functions.

Question 5

Q

Give the names and equations of four common activation functions.

Answer

A

Sigmoid: f(z) = 1 / (1 + e^(-z))
Tanh: f(z) = 2 / (1 + e^(-2z)) + 1
ReLU: f(z) : z if z > 0 else 0
Softmax: f(z_i) = e^(z_i) / Sum_k e^(z_k)

Question 6

Q

What is a loss function?

Answer

A

The loss function is the function we are trying to minimise such that when we do so we learn the relationship between the given inputs and the desired outputs. It is crucial to select / design the correct loss function in order to be able to not only learn but also learn something meaningful.

Question 7

Q

Give the equation for the mean squared error loss function for regression.

Answer

A

(1 / d) * Sum_d (a_d - y_d)^2

Question 8

Q

Give the equation for Binary Cross Entropy Loss (BCE).

Answer

A

L(y, a) = - (y * log(a) + (1 - y) * log(1 - a))

Question 9

Q

Give the gradients of the Loss w.r.t the input, the weights the bias and the activation function

Answer

A

dL/dX = dL/dZ * W^T
dL/dW = X^T * dL/dZ
dL/dB = 1^T * dL/dZ
dL/dA = dL/dZ ∘ f'(X)
where ∘ is element-wise multiplication

Question 10

Q

Give the derivatives of the sigmoid, tanh and ReLU activation functions.

Answer

A

Sigmoid:
f(x) * (1 - f(x))

Tanh:
1 - f^2(x)

ReLU:
1 if x > 0 else 0

Question 11

Q

Describe Gradient Descent

Answer

A

Gradient Descent is an optimisation technique which uses the gradient to find a local minimum. We usually perform it on the loss in order to get the best weights possible, and update our weights using that.

W = W - a * dW

Question 12

Q

Describe what a learning rate is and the effects of choosing a small/large learning rate.

Answer

A

The learning rate is a hyper-parameter which specifies the step-size the gradient descent algorithm takes when finding the local minimum. A very small learning rate will lead to very slow changes in the weights, as it does not explore the function fast enough, while a very large one will cause large jumps and oscillations, sometimes preventing the algorithm from converging to a minimum.

Question 13

Q

Describe L1 and L2 regularization.

Answer

A

One common feature mentioned for the L1 regulariser is that it produces sparse weights. Intuitively it pushes most of the weights to 0, so only the most useful features will need to have non-zero weights for the network to learn / predict the patterns in the data. This sparsity leads to what is called feature selection, the training causes the layer on which L1 regularisation is applied to select few inputs to produce an output in order to keep the weights small.

L2 on the other hand ensures smoothness, so it encourages a combination of inputs / features to be used since weights are not forced towards 0 when they are already small. A layer with L2 thus tries not to rely on any few features but a large combination of them.

Question 14

Q

What is Dropout?

Answer

A

Dropout randomly sets outputs of layers to 0 essentially turning off neurons, which helps to fight overfitting.

Question 15

Q

Describe two data pre-processing methods.

Answer

A

Data Augmentation (Rotating, Flipping, Blurring etc)

- Normalisation

Lec4,5 - Artificial Neural Networks Flashcards

(15 cards)