lecture 5: backpropagation Flashcards

1
Q

single-layer perceptrons

A
  • are limited to linearly separable problems
  • we need to add layers to make universal function approximators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how to find the weights of a multilayer perceptron

A

with backpropagation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the main purpose of backpropagation?

A

Backpropagation calculates and propagates errors backward through the network to adjust weights, enabling the network to learn by minimizing error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

backpropagation steps

A
  1. forward sweep
  2. compare predicted output to true output
  3. compute the error term
  4. update the weight from output to hidden layer
  5. update the weights of deeper layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is forward propagation in a neural network?

A
  • passing input data 𝑥 through the network to compute the output 𝑦 via the intermediate hidden layer activations ℎ.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is information flow represented mathematically in forward propagation?

A
  • the flow is x→h→y
  • Hidden layer activations: h=W^{hx}x
  • Output: y=W^{yh}h
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the first step in backpropagation after the forward sweep?

A

Compare the predicted output y to the target t to calculate the error in the output layer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the error δ for the output layer calculated?

A
  • δ=g′(a_j)⋅(t_j−y_j)
  • output error = derivative of the activation * difference between target and predicted output
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How are weights connected to the output layer updated in backpropagation?

A
  • Δw_jk =−ϵ⋅δ⋅h_k
  • learning rate * error term * input to the weight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does backpropagation work for the hidden layers?

A
  • propagates the error from the output layer back to the hidden layers
  • δ_i=g′(a_i)⋅δ_j
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two key steps in backpropagation?

A
  1. Compute the error term δ for each layer.
  2. Update the weights using the error term and the learning rate.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

neural network architectures

A
  1. recursive neural network
  2. convolutional neural network
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

RNNs

A
  • time as a factor: memory, sequence analysis, temporal predictions, language
  • feed their outputs back into themselves, time-step by time-step
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

RNNs: unrolling

A

expanding the RNN over time steps, treating each time step as a layer in the network for backpropagation through time (BPTT)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

RNNs: feedback loop

A

lets RNNs maintain a record of history of past inputs, and integrate information over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

RNNs: learning

A
  • backpropagation through time (BPTT)
  • backpropagation is extended over time to compute gradients for all time steps in the sequence.
  • this allows RNNs to learn how earlier time steps influence later ones.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

RNNs: pros

A
  • can learn all sorts of sequences, even for remote events in sequences.
  • “remembering” information from earlier in the sequence allows them to learn long-term patterns
  • dynamic, semantic information processing allows for speech recognition and language modeling
18
Q

RNNs: cons

A
  1. Recursion Issues: Feedback loops can lead to numerical instability and training difficulties.
  2. Scaling Challenges: Long sequences create many time steps, leading to many layers, a large number of parameters, and vanishing gradients
19
Q

How is the problem of many time points solved in RNNs

A

LSTM RNNs

20
Q

general problems with making networks bigger

A
  1. larger number of parameters
  2. vanishing gradients
21
Q

problem with large number of parameters

A
  1. take long to train
  2. many local minima
  3. very sensitive to biases in training set
22
Q

what are vanishing gradients

A
  • sigmoid is limited to [0,1]
  • Error gets diluted across many layers, becoming smaller and smaller
  • Lower layers therefore get only tiny weight changes, so they learn very slowly
23
Q

convolutional neural networks

A
  • convolution helps with increasing the number of layers without exploding the number of parameters
  • great for classification tasks: image/sound recognition
24
Q

training CNNs

A

standard backpropagation

25
Q

convolution

A
  • core operation of CNNs
  • mathematical operation where a kernel slides across an input image (or feature map) to produce an output feature map
26
Q

Why do CNNs have fewer weights to train compared to fully connected networks?

A
  • Each neuron in a layer is connected only to a small local patch of the input image, rather than the entire image.
  • The same kernel is reused across the entire image, reducing the number of parameters.
  • i.e., only local connections, leading to fewer weights to train
27
Q

How do CNNs build feature representations?

A
  • from layer to layer
  • convolution is implemented with with more and more ‘complex’ kernels
  • e.g., edges or colors > shapes or textures > complex features, such as object parts.
28
Q

How are CNNs similar to the brain’s visual system?

A

Both CNNs and the visual system use hierarchical, localized processing (similar to neuronal receptive fields) to build an understanding of the input.

29
Q

What activation function for CNNs

A
  • ReLU (Rectified Linear Unit)
  • has a steep gradient when active and no activation for negative input, making training faster.
30
Q

max-pooling

A

takes the maximum value within a small neighborhood (e.g., 2x2) in the feature map to reduce dimensions while retaining important information.

31
Q

SoftMax

A
  • Converts raw scores into probabilities
  • ensuring that:
    1. All outputs are positive.
    2. They sum to 1.
    3. This makes the output interpretable as a probability distribution.
32
Q

layer stacking

A

stacked convolutional layers with max pooling on top of one another, followed by fully (dense) connected layers

33
Q

describe the inner structure of a convnet

A
  • after each convolutional block, a max-pooling layer reduces the size of the feature map while retaining the most important features.
  • after all convolutional and pooling layers, the flattened feature maps are passed into fully connected layers.
  • the final dense layer has as many neurons as there are classes and outputs outputs probabilities for each class using using softmax.
34
Q

what does training these convnets require

A
  1. many parameters
  2. large training sets
  3. fast computers
35
Q

result of convnets

A

better than human performance on visual recognition task

36
Q

What are the benefits of using ReLU over sigmoid?

A
  • avoids the vanishing gradient problem of sigmoid
  • has a simple computation
  • allows for sparse activation, speeding up learning.
37
Q

Why can’t backpropagation alone lead to General Artificial Intelligence (AGI)?

A
  1. Backpropagation is not biologically plausible.
  2. Most human learning is unsupervised, while backpropagation focuses on supervised learning.
38
Q

What does Yann LeCun say about supervised learning?

A

He describes supervised learning as “the cherry on the cake of neural networks” because most learning, including human learning, is unsupervised.

39
Q

What is Geoffrey Hinton’s argument regarding the brain’s capacity?

A
  • The human brain has about 10^14 synapses but only about 10^9 seconds in a lifetime.
  • There are far more parameters than data, so most learning must be unsupervised.
40
Q

Why is backpropagation considered biologically implausible?

A
  1. It requires global knowledge of all gradients and weights to compute local error contributions, which is impossible in the brain.
  2. Backpropagation through time (BPTT) needs error signals to “trickle down” to lower layers, which is impractical as the original input is long gone.
41
Q

What is an example of backpropagation’s limitations in achieving AGI?

A

While backpropagation enables Convolutional Neural Networks (CNNs) to perform specific tasks like image recognition, it cannot replicate the flexible learning and reasoning of the human brain, which involves unsupervised and adaptive learning.

42
Q

Geoffrey Hinton

A

‘The brain has about 10^14 synapses and we only live for about 10^9 seconds. so we have a lot more parameters than data. we must do a lot of unsupervised learning’