Lec4,5 - Artificial Neural Networks Flashcards

1
Q

Give the equation for the output of a single neuron.

A

y = f(w^T * x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a feed-forward network?

A

It is a network of stacked fully connected layers (each having a certain number of neurons). eg:
y = h3(h2(h1(x)))
where h_i is a Layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe two methods for weight initialisation and one for bias initialisation.

A

Weights:

  • Normal Distribution
  • Xavier/Glorot Initialisation

Bias:
- Set to zeros.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we use activation functions?

A

π‘Šπ‘₯+𝑏 on it’s own doesn’t allow us to learn non-linear functions such that the output varies non-linearly with respect to the inputs. Therefore, to introduce non-linearity into the network we apply non-linear activation functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Give the names and equations of four common activation functions.

A
  • Sigmoid: f(z) = 1 / (1 + e^(-z))
  • Tanh: f(z) = 2 / (1 + e^(-2z)) + 1
  • ReLU: f(z) : z if z > 0 else 0
  • Softmax: f(z_i) = e^(z_i) / Sum_k e^(z_k)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a loss function?

A

The loss function is the function we are trying to minimise such that when we do so we learn the relationship between the given inputs and the desired outputs. It is crucial to select / design the correct loss function in order to be able to not only learn but also learn something meaningful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give the equation for the mean squared error loss function for regression.

A

(1 / d) * Sum_d (a_d - y_d)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Give the equation for Binary Cross Entropy Loss (BCE).

A

L(y, a) = - (y * log(a) + (1 - y) * log(1 - a))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give the gradients of the Loss w.r.t the input, the weights the bias and the activation function

A
dL/dX = dL/dZ * W^T
dL/dW = X^T * dL/dZ
dL/dB = 1^T * dL/dZ
dL/dA = dL/dZ ∘ f'(X)
where ∘ is element-wise multiplication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give the derivatives of the sigmoid, tanh and ReLU activation functions.

A

Sigmoid:
f(x) * (1 - f(x))

Tanh:
1 - f^2(x)

ReLU:
1 if x > 0 else 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe Gradient Descent

A

Gradient Descent is an optimisation technique which uses the gradient to find a local minimum. We usually perform it on the loss in order to get the best weights possible, and update our weights using that.

W = W - a * dW

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe what a learning rate is and the effects of choosing a small/large learning rate.

A

The learning rate is a hyper-parameter which specifies the step-size the gradient descent algorithm takes when finding the local minimum. A very small learning rate will lead to very slow changes in the weights, as it does not explore the function fast enough, while a very large one will cause large jumps and oscillations, sometimes preventing the algorithm from converging to a minimum.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe L1 and L2 regularization.

A

One common feature mentioned for the L1 regulariser is that it produces sparse weights. Intuitively it pushes most of the weights to 0, so only the most useful features will need to have non-zero weights for the network to learn / predict the patterns in the data. This sparsity leads to what is called feature selection, the training causes the layer on which L1 regularisation is applied to select few inputs to produce an output in order to keep the weights small.

L2 on the other hand ensures smoothness, so it encourages a combination of inputs / features to be used since weights are not forced towards 0 when they are already small. A layer with L2 thus tries not to rely on any few features but a large combination of them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Dropout?

A

Dropout randomly sets outputs of layers to 0 essentially turning off neurons, which helps to fight overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe two data pre-processing methods.

A
  • Data Augmentation (Rotating, Flipping, Blurring etc)

- Normalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly