Quiz 1 - Linear Classifiers, Gradient Descent, Neural Networks Flashcards
Feedforward Neural Network
Approximate a function using a mapping that finds optimal parameters for function approximation. The layers of a neural network are a Directed Acyclic Graph (DAG)
Default recommendation for activation function of modern neural networks
rectified linear unit (ReLU)
What is the major difference between neural networks and basic linear models?
Nonlinearity of a neural network causes lost functions to become convex. As a result, nn usually train by iterative, gradient-based optimizers that minimize cost (rather than get cost to 0).
For stocastic gradient descent of nonconvex loss functions, there is no guarantee of convergence and sensitivity to initial parameter values.
What is regularization
Modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
Example: weight decay to prevent weight parameters from getting too large (and potentially overfitting to train data).
Neural Network cost function (most common)
NN are trained on maximum likelihood and the cost function is the negative log-likelihood.
Negative log-likelihood cost function
The cross-entropy between the training data and the model distribution (measure of difference between two probability distributions)
___ and ___ often lead to poor results when used with gradient-based optimization
1) mean squared error
2) mean absolute error
What is the purpose of the sigmoid output
Sigmoid activation function converts the output to a probability (0,1) while ensuring that there is a strong gradient for wrong answers.
Many objective functions other than log-likelihood do not work as well with the softmax function. Why?
Objective functions that do not use a log to undo the exponent of the softmax fail to learn when the argument to the exponent becomes very negative, causing the gradient to vanish.
Common hidden layer unit
Relu
- Differentiable (except at 0)
- outputs zero across half its domain
Different types of ReLU functions
- ReLU with non-zero slope for zi < 0
- Absolute Value Rectification
- slope is non-zero and set to -1
- used for object recognition from images
- Leakly ReLU
- slope is non-zero and small, around 0.01
- parametric ReLI
- slope of ReLU is a learnable parameter
- Absolute Value Rectification
What is the problem with using the sigmoid function in a hidden layer?
Sigmoidal units saturate across most of their domain.
When z is strongly positive, they saturate to a high value and when z is strongly negative, they saturate to a low value.
The sigmoidal function is only strongly sensitive near z = 0.
What is typically true of deeper neural networks?
They are often able to use far fewer units per layer and far fewer parameters
universal approximation theorem
a feedforward network with a linear output layer and at least one hidden layer with any squashing activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.
Forward Propagation
The neural network input provides information that propagates up the hidden units at each layer and finally produces the output y’
back-propagation “backprop”
Allows information from the cost to flow backward through the network in order to compute the gradient
Scalar derivative rules
(fill out the table)
What is the gradient of a function f(x,y)?
The vector of its partial derivatives
[df(x,y)/dx, df(x,y)/dy]