Neural Networks Flashcards
What activation function did the first neural network use?
Heaviside step function
What method is used for optimizing multi layer neural networks
Backpropagation
Name some improvements to neural networks the past 30 years
1 ) Better hardware
2) Deeper networks
3) Larger datasetets
4) Other changes, better activation funcitons, different layers…
How can we adapt gradient descent to work with very large training sets?
Stochastic gradient descent. (Use a random batch from the training data and update the weights using this batch).
How does training step influence the error for:
1) Very high learning rate
2) high learning rate
3) low learning rate
1) The error will increase rapidly
2) The error will decrease rapidly in the begining and then “flatten” out, never reaching the optimum
3) The error will decrease slowly.
What is the vanishing gradient problem?
Activation functions like the sigmoid saturate for large/small values of x, meaning the gradient is close to 0.
What is the exploding gradient problem?
The gradient suddently increases a lot. The gradient descent algorithm can “jump” far away from the optimal solution.
How can we adapt gradient descent to fix the vanishing and exploding gradient problems?
1) We can use adaptiv stepsizes.
2) We can Clip the gradient using thresholding or L2 norm.
What is the main advantage of the relu over the sigmoid activation function?
It doesn’t saturate for high values of x.
How can we deal with relu saturation for input values below 0?
We can use leaky relu, PRelu, or Elu instead.
What types of problems is the mean squared loss function most commonly used for?
Regression.
What ouput function do we usually use for the binary classification problem?
Sigmoid (Or softmax with 2 outputs, one for “true” one for “false”…)
What output function do we usually use for Multi-class classification problems?
Softmax
What loss function do we usually use for classification?
Cross entropy
What is data augmentation?
We increase the training set by adding distorted, squeezed, tilted… versions of the original dataset.
What assumption do we make when using a CNN?
nearby features (for example pixels) are dependent on eachother.
How we make sure that the output of a convolutional layer is the same size as the input.
Zero-pad the input.
What happens to the output size if have stride=2 in a convolutional layer? (When the input is zero padded..)
The output size is 1/4 of the input size.
How does the kernel in a convo layer look if we use dilation = 2 and size 3?
[X, –, X
–, X, –
X, –, X]
We only “look” at values where the kernel is X.
What is pooling in a convolutional network?
It is like downsampling. We usually use max-pool meaning that the maximum value in the local region is selected.
What is a residual net?
A residual net has “skip connections” mainly allowing the gradient “skip” layers on backprop to leviate vanishing gradients.
Name some famous neural networks
GoogleNet, ResNet, AlexNet, VGG.