lecture 5: backpropagation Flashcards
single-layer perceptrons
- are limited to linearly separable problems
- we need to add layers to make universal function approximators
how to find the weights of a multilayer perceptron
with backpropagation
What is the main purpose of backpropagation?
Backpropagation calculates and propagates errors backward through the network to adjust weights, enabling the network to learn by minimizing error.
backpropagation steps
- forward sweep
- compare predicted output to true output
- compute the error term
- update the weight from output to hidden layer
- update the weights of deeper layers
What is forward propagation in a neural network?
- passing input data 𝑥 through the network to compute the output 𝑦 via the intermediate hidden layer activations ℎ.
How is information flow represented mathematically in forward propagation?
- the flow is x→h→y
- Hidden layer activations: h=W^{hx}x
- Output: y=W^{yh}h
What is the first step in backpropagation after the forward sweep?
Compare the predicted output y to the target t to calculate the error in the output layer.
How is the error δ for the output layer calculated?
- δ=g′(a_j)⋅(t_j−y_j)
- output error = derivative of the activation * difference between target and predicted output
How are weights connected to the output layer updated in backpropagation?
- Δw_jk =−ϵ⋅δ⋅h_k
- learning rate * error term * input to the weight
How does backpropagation work for the hidden layers?
- propagates the error from the output layer back to the hidden layers
- δ_i=g′(a_i)⋅δ_j
What are the two key steps in backpropagation?
- Compute the error term δ for each layer.
- Update the weights using the error term and the learning rate.
neural network architectures
- recursive neural network
- convolutional neural network
RNNs
- time as a factor: memory, sequence analysis, temporal predictions, language
- feed their outputs back into themselves, time-step by time-step
RNNs: unrolling
expanding the RNN over time steps, treating each time step as a layer in the network for backpropagation through time (BPTT)
RNNs: feedback loop
lets RNNs maintain a record of history of past inputs, and integrate information over time
RNNs: learning
- backpropagation through time (BPTT)
- backpropagation is extended over time to compute gradients for all time steps in the sequence.
- this allows RNNs to learn how earlier time steps influence later ones.
RNNs: pros
- can learn all sorts of sequences, even for remote events in sequences.
- “remembering” information from earlier in the sequence allows them to learn long-term patterns
- dynamic, semantic information processing allows for speech recognition and language modeling
RNNs: cons
- Recursion Issues: Feedback loops can lead to numerical instability and training difficulties.
- Scaling Challenges: Long sequences create many time steps, leading to many layers, a large number of parameters, and vanishing gradients
How is the problem of many time points solved in RNNs
LSTM RNNs
general problems with making networks bigger
- larger number of parameters
- vanishing gradients
problem with large number of parameters
- take long to train
- many local minima
- very sensitive to biases in training set
what are vanishing gradients
- sigmoid is limited to [0,1]
- Error gets diluted across many layers, becoming smaller and smaller
- Lower layers therefore get only tiny weight changes, so they learn very slowly
convolutional neural networks
- convolution helps with increasing the number of layers without exploding the number of parameters
- great for classification tasks: image/sound recognition
training CNNs
standard backpropagation
convolution
- core operation of CNNs
- mathematical operation where a kernel slides across an input image (or feature map) to produce an output feature map
Why do CNNs have fewer weights to train compared to fully connected networks?
- Each neuron in a layer is connected only to a small local patch of the input image, rather than the entire image.
- The same kernel is reused across the entire image, reducing the number of parameters.
- i.e., only local connections, leading to fewer weights to train
How do CNNs build feature representations?
- from layer to layer
- convolution is implemented with with more and more ‘complex’ kernels
- e.g., edges or colors > shapes or textures > complex features, such as object parts.
How are CNNs similar to the brain’s visual system?
Both CNNs and the visual system use hierarchical, localized processing (similar to neuronal receptive fields) to build an understanding of the input.
What activation function for CNNs
- ReLU (Rectified Linear Unit)
- has a steep gradient when active and no activation for negative input, making training faster.
max-pooling
takes the maximum value within a small neighborhood (e.g., 2x2) in the feature map to reduce dimensions while retaining important information.
SoftMax
- Converts raw scores into probabilities
- ensuring that:
1. All outputs are positive.
2. They sum to 1.
3. This makes the output interpretable as a probability distribution.
layer stacking
stacked convolutional layers with max pooling on top of one another, followed by fully (dense) connected layers
describe the inner structure of a convnet
- after each convolutional block, a max-pooling layer reduces the size of the feature map while retaining the most important features.
- after all convolutional and pooling layers, the flattened feature maps are passed into fully connected layers.
- the final dense layer has as many neurons as there are classes and outputs outputs probabilities for each class using using softmax.
what does training these convnets require
- many parameters
- large training sets
- fast computers
result of convnets
better than human performance on visual recognition task
What are the benefits of using ReLU over sigmoid?
- avoids the vanishing gradient problem of sigmoid
- has a simple computation
- allows for sparse activation, speeding up learning.
Why can’t backpropagation alone lead to General Artificial Intelligence (AGI)?
- Backpropagation is not biologically plausible.
- Most human learning is unsupervised, while backpropagation focuses on supervised learning.
What does Yann LeCun say about supervised learning?
He describes supervised learning as “the cherry on the cake of neural networks” because most learning, including human learning, is unsupervised.
What is Geoffrey Hinton’s argument regarding the brain’s capacity?
- The human brain has about 10^14 synapses but only about 10^9 seconds in a lifetime.
- There are far more parameters than data, so most learning must be unsupervised.
Why is backpropagation considered biologically implausible?
- It requires global knowledge of all gradients and weights to compute local error contributions, which is impossible in the brain.
- Backpropagation through time (BPTT) needs error signals to “trickle down” to lower layers, which is impractical as the original input is long gone.
What is an example of backpropagation’s limitations in achieving AGI?
While backpropagation enables Convolutional Neural Networks (CNNs) to perform specific tasks like image recognition, it cannot replicate the flexible learning and reasoning of the human brain, which involves unsupervised and adaptive learning.
Geoffrey Hinton
‘The brain has about 10^14 synapses and we only live for about 10^9 seconds. so we have a lot more parameters than data. we must do a lot of unsupervised learning’