Neural Networks Flashcards
What activation function did the first neural network use?
Heaviside step function
What method is used for optimizing multi layer neural networks
Backpropagation
Name some improvements to neural networks the past 30 years
1 ) Better hardware
2) Deeper networks
3) Larger datasetets
4) Other changes, better activation funcitons, different layers…
How can we adapt gradient descent to work with very large training sets?
Stochastic gradient descent. (Use a random batch from the training data and update the weights using this batch).
How does training step influence the error for:
1) Very high learning rate
2) high learning rate
3) low learning rate
1) The error will increase rapidly
2) The error will decrease rapidly in the begining and then “flatten” out, never reaching the optimum
3) The error will decrease slowly.
What is the vanishing gradient problem?
Activation functions like the sigmoid saturate for large/small values of x, meaning the gradient is close to 0.
What is the exploding gradient problem?
The gradient suddently increases a lot. The gradient descent algorithm can “jump” far away from the optimal solution.
How can we adapt gradient descent to fix the vanishing and exploding gradient problems?
1) We can use adaptiv stepsizes.
2) We can Clip the gradient using thresholding or L2 norm.
What is the main advantage of the relu over the sigmoid activation function?
It doesn’t saturate for high values of x.
How can we deal with relu saturation for input values below 0?
We can use leaky relu, PRelu, or Elu instead.
What types of problems is the mean squared loss function most commonly used for?
Regression.
What ouput function do we usually use for the binary classification problem?
Sigmoid (Or softmax with 2 outputs, one for “true” one for “false”…)
What output function do we usually use for Multi-class classification problems?
Softmax
What loss function do we usually use for classification?
Cross entropy
What is data augmentation?
We increase the training set by adding distorted, squeezed, tilted… versions of the original dataset.