[2] Neural Networks Flashcards
What determines the output of a neuron?
It is the biased, weight sum of its inputs passed into an activation function
Why are activation functions important?
They allow the network to learn non-linearities
What is a perceptron?
A special type of ANN with:
- Real-valued inputs
- Binary output
- Threshold activation function
How are perceptrons trained?
Increase the weights (this is the threshold) based on whether the class is higher or lower than the perceptron
What idea limits the generalisability of perceptrons?
The Perceptron Convergence Theorem stats that perceptrons will converge if and only if the problem is linearly separable
Hence, they can’t learn XOR
What are the general approaches to updating weights?
Online learning updates weights after every instance; offline learning does it after every epoch.
Batch learning updates weights after every batch of instances
What algorithm is used to train neural networks?
Backpropagation:
[1] Calculate the predicted output using the current weights
[2] Calculate the error
[3] Update each weight in proportion to its gradient to the error i.e. how much changing that weight affects the error
Note: weights are trained backwards i.e. start at the last hidden layer
What are some potential issues when using backpropagation?
Improper learning rate leads to divergence or slow convergence
Overfitting if training too long, for using too many weights, or using too few instances
Local minima
How should variables be represented in an ANN?
Use a binary representation (i.e. one hot encoding) for nominal variables
For numeric variables, consider scaling or standardisation
What is scaling and standardization? When should each be used?
Scaling - scale then numbers between [0,1] if they are on a similar range
Standardisation - assume a normal distribution and scale it to N(0,1) if the values are more varied
What can happen if ANN weights aren’t set appropriately?
If they are all set to 0, the network will be symmetric i.e. all the weights will change together, and so it won’t train
If the weights are too high, the activation will be in the part of the sigmoid with a shallow gradient, and so training will be slow,
How should ANN weights be set?
Using fan-in factor, i.e. using a uniform random generator between -1/sqrt(d) and 1/sqrt(d) where d is the number of inputs
This ensures the variance of the weighted sum is approximately 1/3
How can back propagation be sped up?
With momentum, in which gradients from previous steps are used in addition to the current gradient
How can weight matricies be visualised?
With Hinton diagrams, in which the size of the square is based on the magnitude; it is white if it is positive and black if it is negative
What are the key principles of CNNs?
The automatically extract features to produce a feature map
They are not fully connected - convolutions with shared weights are used instead
What are the dimensions of a feature map?
In each direction, it is (image_size - filter_size) / shift + 1
What techniques can be applied to optimise CNNs?
- Subsampling aggregates based on the maximal value; this reduces data while retaining/emphasizing the information
- Weight smoothing use used when domain specific knowledge suggests adjacent inputs are related
- Centered weight initialization starts with higher weights in the center, as these are often where objects are found
What is a weight agnostic network?
It has a single weight shared by the whole network; training is done by newtwork topology search.
What operations occur while training a weight agnostic network?
- Insert node by splitting an existing connection
- Add connection - connect two previously connected nodes
- Change activation - change the activation function of a node
What are HONNs?
Higher order neural networks connect each input to multiple nodes in the first hidden layer.
The order is the number of nodes that each input connects to.
CNN are a special type of HONN
Why are HONNs useful?
Instead of just taking the weighted sum, for each combination of inputs they take a sum of the weighted products.
This allows them to explore higher order relationships i.e. products; for example, they can solve XOR
What are self-organizing maps?
They represent high dimensional data in lower dimensions by weighting mapping inputs to neurons
The weights are trained by competitive learning in which the node whose weight is closest to the input value is chosen to fire. It updates its weights to reinforce those that made it win. the neighborhood function preserves topology
What are residual neural networks?
They have shortcut connections between layers.
This makes training more effective as it reduces the vanishing gradient effect
What is EvoCNN?
A GA to automatically train network structures. It uses a two-level encoding to describe the layers and then their connections
Each mutation performs one of three actions:
- Add a new unit (convolutional, pooling or full)
- Modify an existing unit’s encoded information
- Delete an existing unit
What are auto-encoders?
Neural networks that have been trained to copy their input to their output
The use an intermediate layer called the latent representation
How is the loss of an auto-encoder calculated?
The difference between the input and output (or a special domain-specific variations)
What are the main configurations of auto encoder?
Under-complete auto encoders have latent representations with smaller dimensions than the input and output.
Otherwise, they are over-complete
What do under-complete auto-encoders do?
They learn the most salient features of the data
How do over-complete auto-encoders work?
They use regularisation to avoid simply copying the data
Sparsity auto encoders trie to push as many output values of the latent representation to 0 as possible
Contractive auto encoders regularise by derivative penalty, meaning the output of each node is smooth if the input changes slightly. This makes them robust to slight fluctuations
What are some particular applications of auto encoders?
De-noising auto-encoders remove noise from the image
Variational auto-encoders modify an image in a desired way. However, the latency space must might not be continuous and so instead of a point in the latent space, a distribution is used.
Why is cross-entropy used?
It’s gradients are more pronounced at extreme values, leading to faster convergence
Why is ReLU often used?
It is fast to compute, minimises the impact of vanishing gradient, and encourages sparsity
What is the purpose of regularisation?
It prevents weights from getting too large, and pushes as many to zero as possible (allowing them to be ignored)
What is a particular type of regularisation?
Lasso regularisation uses L1 to remove irrelevant variables from a linear model
How does dropout work?
A random percentage of neutrons are removed on each mini-batch
Note: for inference, the weights must be multiplied by (1-p)%
What are the general strategies for transfer learning?
Learn shared hidden representations (i.e. DLID). This is useful if the classes are the same, but the way they are captured differs (i.e. different camera types)
Shared features - use this when the head layers do the same general task, but the tail does a particular task.