Lec4,5 - Artificial Neural Networks Flashcards
Give the equation for the output of a single neuron.
y = f(w^T * x)
What is a feed-forward network?
It is a network of stacked fully connected layers (each having a certain number of neurons). eg:
y = h3(h2(h1(x)))
where h_i is a Layer
Describe two methods for weight initialisation and one for bias initialisation.
Weights:
- Normal Distribution
- Xavier/Glorot Initialisation
Bias:
- Set to zeros.
Why do we use activation functions?
ππ₯+π on itβs own doesnβt allow us to learn non-linear functions such that the output varies non-linearly with respect to the inputs. Therefore, to introduce non-linearity into the network we apply non-linear activation functions.
Give the names and equations of four common activation functions.
- Sigmoid: f(z) = 1 / (1 + e^(-z))
- Tanh: f(z) = 2 / (1 + e^(-2z)) + 1
- ReLU: f(z) : z if z > 0 else 0
- Softmax: f(z_i) = e^(z_i) / Sum_k e^(z_k)
What is a loss function?
The loss function is the function we are trying to minimise such that when we do so we learn the relationship between the given inputs and the desired outputs. It is crucial to select / design the correct loss function in order to be able to not only learn but also learn something meaningful.
Give the equation for the mean squared error loss function for regression.
(1 / d) * Sum_d (a_d - y_d)^2
Give the equation for Binary Cross Entropy Loss (BCE).
L(y, a) = - (y * log(a) + (1 - y) * log(1 - a))
Give the gradients of the Loss w.r.t the input, the weights the bias and the activation function
dL/dX = dL/dZ * W^T dL/dW = X^T * dL/dZ dL/dB = 1^T * dL/dZ dL/dA = dL/dZ β f'(X) where β is element-wise multiplication
Give the derivatives of the sigmoid, tanh and ReLU activation functions.
Sigmoid:
f(x) * (1 - f(x))
Tanh:
1 - f^2(x)
ReLU:
1 if x > 0 else 0
Describe Gradient Descent
Gradient Descent is an optimisation technique which uses the gradient to find a local minimum. We usually perform it on the loss in order to get the best weights possible, and update our weights using that.
W = W - a * dW
Describe what a learning rate is and the effects of choosing a small/large learning rate.
The learning rate is a hyper-parameter which specifies the step-size the gradient descent algorithm takes when finding the local minimum. A very small learning rate will lead to very slow changes in the weights, as it does not explore the function fast enough, while a very large one will cause large jumps and oscillations, sometimes preventing the algorithm from converging to a minimum.
Describe L1 and L2 regularization.
One common feature mentioned for the L1 regulariser is that it produces sparse weights. Intuitively it pushes most of the weights to 0, so only the most useful features will need to have non-zero weights for the network to learn / predict the patterns in the data. This sparsity leads to what is called feature selection, the training causes the layer on which L1 regularisation is applied to select few inputs to produce an output in order to keep the weights small.
L2 on the other hand ensures smoothness, so it encourages a combination of inputs / features to be used since weights are not forced towards 0 when they are already small. A layer with L2 thus tries not to rely on any few features but a large combination of them.
What is Dropout?
Dropout randomly sets outputs of layers to 0 essentially turning off neurons, which helps to fight overfitting.
Describe two data pre-processing methods.
- Data Augmentation (Rotating, Flipping, Blurring etc)
- Normalisation