Connectionist Prep Flashcards
What is learning by gradient descent? Explain the general idea behind it, and the role the error E has in it.
- algorithm which aims to minimise the error (E) of the NN by adjusting the model’s parameters
- involves computing gradients of E with respect to these parameters
- the parameters are adjusted in the opposite direction of the gradient to minimise E
- iteratively reduces E, in pursuit of a global minimum
Briefly describe what the backpropagation algorithm is, and in which way it relates to gradient descent.
- algorithm used to train artificial NN by minimising the error between the predicted output and the target values
- Two main phases: Forward Pass, Backward Pass (backpropagation)
- FP: Input data is fed into the NN, layer-by-layer computations yield the predicted output
- Backpropagation: works backwards through the layers, using gradient descent to minimise the error of the NN by adjusting the model’s parameters at each layer.
- involves calculus to calculate the partial derivatives of the loss function with respect to each parameter (weights and bias)
What are the common problems of gradient descent, that may limit its effectiveness?
- local minima
- slow convergence
- sensitivity to learning rate
- dependence on initial weight selection
Explain the role of activation functions in NN
They play a crucial role by introducing non-linearities to the model, which are essential for enabling NN to learn complex patterns in the data
What is the purpose of the cost function in a NN
Also known as the loss function, it quantifies the inconsistency between predicted values and the corresponding correct values
explain the role of bias terms in a NN
- bias terms add a level of flexibility and adaptability to the model.
- they “shift” the activation function, providing every neuron with a trainable constant value, in addition to the inputs
What is a perceptron
an artificial neuron which takes in many input signals and produces a single binary output signal (0 or 1)
Explain the differences between Batch Gradient Descent and Stochastic Gradient Descent
in BGD, the model parameters are updated in one go, based on the average gradient of the entire training dataset. In SGD, updates occur for each training example or mini-batch.
Which gradient descent is preferred for large datasets and why.
Stochastic GD is preferred over Batch GD.
- Although BGD usually converges to a more accurate minimum, it is computationally expensive (extremely)
- SGD converges faster and requires less memory. However, updates can be noisy, and it may converge to a local minimum rather than the global minimum
Define generalisation
The ability of a trained model to perform well on unseen data
How can you measure the generalisation ability of a MLP
- cross validation
- hold-out strategy (train/test sets)
- consider choice of evaluation measure
How can you decide on an optimal number of hidden units?
- apply to domain knowledge to estimate a range
- test the model on the range to fine tune the selection
- this may be unfeasible on complex models
Explain the difference between two common activation functions of your choice
Sigmoid vs TanH
1. Output Range:
- Sigmoid: (0,1): used for binary classification
- tanh: (-1, 1): suitable for zero-centred data
2. Symmetry:
- Sigmoid is asymmetric, biased towards positive values
- tanh is symmetric around the origin (0, 0)
What are the problems with squared error as the loss function, give two alternatives
There are tricky problems with squared error:
- if the desired output is 1 and the actual output is very close to 0, there is almost no gradient
- alternatives: softmax, relative entropy
Define what a Deep Neural Network is
- consists of multiple layers that transform the input in a hierarchical fashion
- they typically are feed-forward NN with multiple hidden layers, allowing modelling of complex non-linear relationships
Formal definition of overfitting in practice
during learning the error on the training examples decreases all along, but the error in generalisation reaches a minimum and then starts growing again.
Training data contains information about the regularities in the mapping from input to output.
But it also contains noise, explain how.
- the target values may be unreliable
- there will be accidental regularities just because of the particular training cases that were chosen
When we fit a model, it cannot tell which regularities are real and which are caused by sampling error. Which regularity does it fit, what is the worst case scenario?
- Both
- worst case: If the model is very flexible it can model the sampling error really well
What does a model having the “right capacity” entail
- enough to model the true regularities
- not enough to also model the spurious regularities
How to prevent overfitting in NN
- limiting number of weights
- weight decay
- early stopping
- combining diverse networks
Standard ways to limit the capacity of a neural net
- Limit the number of hidden units.
- Limit the size of the weights.
- Stop the learning before it has time to overfit.