Connectionist Prep Flashcards
What is learning by gradient descent? Explain the general idea behind it, and the role the error E has in it.
- algorithm which aims to minimise the error (E) of the NN by adjusting the model’s parameters
- involves computing gradients of E with respect to these parameters
- the parameters are adjusted in the opposite direction of the gradient to minimise E
- iteratively reduces E, in pursuit of a global minimum
Briefly describe what the backpropagation algorithm is, and in which way it relates to gradient descent.
- algorithm used to train artificial NN by minimising the error between the predicted output and the target values
- Two main phases: Forward Pass, Backward Pass (backpropagation)
- FP: Input data is fed into the NN, layer-by-layer computations yield the predicted output
- Backpropagation: works backwards through the layers, using gradient descent to minimise the error of the NN by adjusting the model’s parameters at each layer.
- involves calculus to calculate the partial derivatives of the loss function with respect to each parameter (weights and bias)
What are the common problems of gradient descent, that may limit its effectiveness?
- local minima
- slow convergence
- sensitivity to learning rate
- dependence on initial weight selection
Explain the role of activation functions in NN
They play a crucial role by introducing non-linearities to the model, which are essential for enabling NN to learn complex patterns in the data
What is the purpose of the cost function in a NN
Also known as the loss function, it quantifies the inconsistency between predicted values and the corresponding correct values
explain the role of bias terms in a NN
- bias terms add a level of flexibility and adaptability to the model.
- they “shift” the activation function, providing every neuron with a trainable constant value, in addition to the inputs
What is a perceptron
an artificial neuron which takes in many input signals and produces a single binary output signal (0 or 1)
Explain the differences between Batch Gradient Descent and Stochastic Gradient Descent
in BGD, the model parameters are updated in one go, based on the average gradient of the entire training dataset. In SGD, updates occur for each training example or mini-batch.
Which gradient descent is preferred for large datasets and why.
Stochastic GD is preferred over Batch GD.
- Although BGD usually converges to a more accurate minimum, it is computationally expensive (extremely)
- SGD converges faster and requires less memory. However, updates can be noisy, and it may converge to a local minimum rather than the global minimum
Define generalisation
The ability of a trained model to perform well on unseen data
How can you measure the generalisation ability of a MLP
- cross validation
- hold-out strategy (train/test sets)
- consider choice of evaluation measure
How can you decide on an optimal number of hidden units?
- apply to domain knowledge to estimate a range
- test the model on the range to fine tune the selection
- this may be unfeasible on complex models
Explain the difference between two common activation functions of your choice
Sigmoid vs TanH
1. Output Range:
- Sigmoid: (0,1): used for binary classification
- tanh: (-1, 1): suitable for zero-centred data
2. Symmetry:
- Sigmoid is asymmetric, biased towards positive values
- tanh is symmetric around the origin (0, 0)
What are the problems with squared error as the loss function, give two alternatives
There are tricky problems with squared error:
- if the desired output is 1 and the actual output is very close to 0, there is almost no gradient
- alternatives: softmax, relative entropy
Define what a Deep Neural Network is
- consists of multiple layers that transform the input in a hierarchical fashion
- they typically are feed-forward NN with multiple hidden layers, allowing modelling of complex non-linear relationships
Formal definition of overfitting in practice
during learning the error on the training examples decreases all along, but the error in generalisation reaches a minimum and then starts growing again.
Training data contains information about the regularities in the mapping from input to output.
But it also contains noise, explain how.
- the target values may be unreliable
- there will be accidental regularities just because of the particular training cases that were chosen
When we fit a model, it cannot tell which regularities are real and which are caused by sampling error. Which regularity does it fit, what is the worst case scenario?
- Both
- worst case: If the model is very flexible it can model the sampling error really well
What does a model having the “right capacity” entail
- enough to model the true regularities
- not enough to also model the spurious regularities
How to prevent overfitting in NN
- limiting number of weights
- weight decay
- early stopping
- combining diverse networks
Standard ways to limit the capacity of a neural net
- Limit the number of hidden units.
- Limit the size of the weights.
- Stop the learning before it has time to overfit.
How to limit the size of the model by using fewer hidden units in practice
trial and error
What is weight-decay
- method for limiting the size of a model
- involves adding an extra term to the cost function that penalises the squared weights, i.e. keeps weights small unless they have large error derivatives
What does weight decay prevent, and what does it improve and how?
- It prevents the network from using weights that it does not need.
- It tends to keep the network in the linear region, where its capacity is lower.
- This helps to stop it from fitting the sampling error. It makes a smoother model in which the output changes more slowly as the input changes.
- It can often improve generalisation a lot.
What is the idea behind early stopping for preventing overfitting
- expensive to train a big model with lots of data
- cheaper to stop adjusting weights once generalisation starts getting worse
- the capacity is limited because the weights have not had time to grow big
what hold out strategy is recommended for model selection of a MLP
Strategies which include a validation set
What is gradient descent
- optimisation technique used to find optimal parameters of a model by iteratively updating them in the direction of the steepest descent of the loss function.
- aims to minimise the error of the model
What is the recommended method for the three-way hold-out strategy when training a NN
- training data: used for learning the parameters of the model
- validation data: used for deciding what type of model and what amount of regularisation works best (fine tuning)
- test data: used to get a final, unbiased estimate of how well the network works
Why NN ensembles?
The average error of a group of predictors is always smaller than the average error of the single predictors (unless the predictors are identical)
Briefly explain the steps of k-Fold Cross Validation
- Divide the data into k disjoint subsets - “folds”
- For each of k experiments, use k-1 folds for training and the selected one fold for testing.
- Repeat for all k folds, average the accuracy/error rates.
How to achieve Network Ensembling with just one training
Use Dropout method
- during training, at each step knock out some randomly chosen connections
- when predicting, use all connections. You will need to introduce a normalising constant for this to work
- equivalent to having a very large ensemble of networks
Precautions for applying dropout in practice
- it doesn’t always work: preconditions required
- Must begin with an oversized net capacity to avoid underfitting
What is online learning
Weight updates occur for each example during Gradient Descent
Discuss the value of the gradient at different error angles, what does gradient descent do at these values and are we satisfied with this?
- the gradient is large where the error is steep, small where the error is flat
- Sometimes we would like to run where it’s flat and slow down when it gets too steep. GD does precisely the contrary
Briefly discuss some of the fixes for the issues of Gradient Descent
Use an adaptive learning rate:
- increase the rate slowly if it’s not diverging
- decrease the rate quickly when it starts diverging
Use Momentum: instead of using the gradient to change the position of the weight, change the velocity of the change
Use fixed step: GD decides where to go, but always at same pace
Normalise the gradient based on some combination of previous gradients
Explain what RL is, and what we want to learn from it
Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment
- we want to learn how to act to accomplish goals
- given an environment that contains rewards, we want to learn a policy for acting
Define a simple RL Setup And Goal
Setup: We have an agent which is interacting with an environment which it can affect through actions. The agent may be able to sense the environment partially or fully.
Goal: the agent tries to maximise the long term reward conveyed using a reward signal
Explain the differences between Supervised and Reinforcement Learning
- In SL, there’s an external “supervisor”, which has knowledge of the environment and who shares it with the agent to complete the task
- Both strategies use mappings between inputs and outputs, but in RL there is a reward function which acts as a feedback to the agent
- Supervised learning relies on labelled training data
Explain the differences between Unsupervised and Reinforcement Learning
- In UL, there is no feedback from the environment
- In UL, task is to find the underlying patterns rather than the mapping from input to output
Characteristics of RL
- no supervisor, only a reward signal
- feedback is delayed, not instantaneous
- time really matters
- Agent’s actions have immediate consequences
Why is Deep Learning hard
- when networks get deep the gradient vanishes
- when a network is untrained, the deeper down a hidden unit is, the more subtle its effect on the outputs
- this means it doesn’t do much to the error if you change it
Briefly discuss the two main categories for Deep Learning solutions
Pre-training:
- stack deep networks layer-by-layer
- make sure each layer represents the previous layer meaningfully before adding another layer
Use Artificial Targets: the real problem is that inner layers don’t get gradient, so every now and then use hard targets:
- generate some random targets for the layer
- evaluate them all
- use the best one
Advantages and disadvantages of pre-training by auto-association
Adv: you can use unlabelled data
Disadvantage (potentially disastrous):
- you aren’t considering at all the property you
want to predict -
you compress regardless of the property. If it’s
lossy, the loss can be in the wrong place..
Advantages and disadvantages of pre-training without auto-association
Advantage:
- you compress based on the property you are trying to predict. If it’s lossy, the loss is probably in the right place
Disadvantage:
- can’t use unlabelled training data
- shorter training
What is clustering
- unsupervised learning: grouping un-labelled data
- find underlying patterns in the data
- large choice of distance functions
- partitioning and hierarchical methods
Possible Clustering implementations for connectionist models
k-means can be achieved by using backpropagation in a non-linear self-associating network:
- there is one hidden layer with each node representing a cluster centre
- the hidden layer is hardmax, therefore only one of the neurons will be activated from the input
- the neurons weight will be adjusted when it is activated (recomputes cluster centre)
What are the strengths and weaknesses of Clustering compared to Principal Component Analysis (in connectionist models)?
PCA: linear self-associating networks
Clustering: non-linear (hardmax) self-associating networks
- PCA builds global features (strength) while clustering builds local features (weakness)
- PCA only considers linear combinations of the inputs (weakness)
- clustering builds much stronger features (strength)
State what the difference is between Feedforward and Feedback networks
Information Flow:
- In FFN, information flows in one direction
- FBNs have recurrent connections, allows them to maintain and propagate information over time
- Consequently, FBNs can model sequences and time-dependent data
Which network architecture (FFN, FBN) do you think are easier to deal with? Justify your choice.
- FFNs are easier to train and are more stable because there are no feedback loops
- FBNs can be more challenging to train due to vanishing gradients
Describe Hopfield Networks and Boltzmann Machines
- both are types of Recurrent NN (RNN)
- HNs consists of binary threshold units with symmetric connections
- BMs use binary stochastic units and incorporate a probabilistic aspect in the update rule
Discuss the learning process in Hopfield Networks and Boltzmann machines
Hopfield Networks:
- learning involves adjusting the weights to store certain memories or patterns, essentially capturing second-order interactions
Boltzmann machines
- learn to generate configurations according to a probability distribution
- involves adjusting the weights based on the correlation differences in the training and generated data
Do Hopfield Networks and Boltzmann machines tackle similar problems?
- HNs are deterministic, while BMs are probabilistic
- Consequently, BMs can represent higher-order interactions
Discuss similarities and differences between MLPs and Boltzmann Machines
- both can have hidden layers
- feedforward vs recurrent
- role of hidden units are somewhat similar in both models, aiming to learn complex patterns/structures
- the manner in which HU operate are different: deterministic in MLP, probabilistic in BM