Quiz 1 - Linear Classifiers, Gradient Descent, Neural Networks Flashcards
Feedforward Neural Network
Approximate a function using a mapping that finds optimal parameters for function approximation. The layers of a neural network are a Directed Acyclic Graph (DAG)

Default recommendation for activation function of modern neural networks
rectified linear unit (ReLU)

What is the major difference between neural networks and basic linear models?
Nonlinearity of a neural network causes lost functions to become convex. As a result, nn usually train by iterative, gradient-based optimizers that minimize cost (rather than get cost to 0).
For stocastic gradient descent of nonconvex loss functions, there is no guarantee of convergence and sensitivity to initial parameter values.
What is regularization
Modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.
Example: weight decay to prevent weight parameters from getting too large (and potentially overfitting to train data).
Neural Network cost function (most common)
NN are trained on maximum likelihood and the cost function is the negative log-likelihood.
Negative log-likelihood cost function
The cross-entropy between the training data and the model distribution (measure of difference between two probability distributions)

___ and ___ often lead to poor results when used with gradient-based optimization
1) mean squared error
2) mean absolute error
What is the purpose of the sigmoid output
Sigmoid activation function converts the output to a probability (0,1) while ensuring that there is a strong gradient for wrong answers.
Many objective functions other than log-likelihood do not work as well with the softmax function. Why?
Objective functions that do not use a log to undo the exponent of the softmax fail to learn when the argument to the exponent becomes very negative, causing the gradient to vanish.
Common hidden layer unit
Relu
- Differentiable (except at 0)
- outputs zero across half its domain
Different types of ReLU functions
- ReLU with non-zero slope for zi < 0
- Absolute Value Rectification
- slope is non-zero and set to -1
- used for object recognition from images
- Leakly ReLU
- slope is non-zero and small, around 0.01
- parametric ReLI
- slope of ReLU is a learnable parameter
- Absolute Value Rectification
What is the problem with using the sigmoid function in a hidden layer?
Sigmoidal units saturate across most of their domain.
When z is strongly positive, they saturate to a high value and when z is strongly negative, they saturate to a low value.
The sigmoidal function is only strongly sensitive near z = 0.
What is typically true of deeper neural networks?
They are often able to use far fewer units per layer and far fewer parameters
universal approximation theorem
a feedforward network with a linear output layer and at least one hidden layer with any squashing activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.
Forward Propagation
The neural network input provides information that propagates up the hidden units at each layer and finally produces the output y’
back-propagation “backprop”
Allows information from the cost to flow backward through the network in order to compute the gradient
Scalar derivative rules
(fill out the table)


What is the gradient of a function f(x,y)?
The vector of its partial derivatives
[df(x,y)/dx, df(x,y)/dy]
What is the Jacobian of two functions f(x,y) and g(x,y)?
The matrix of the partial derivatives (the gradients for each function are rows)

for neural networks, what is the dot product of vector:
w . x
The dot product of w . x is the summation of the element-wise multiplication of the elements
(ie. w . x = wTx)
What does a linear classifier consist of?
- an input
- a function of the input
- a loss function
* One function decomposed into building blocks
What modulates the output of the “neuron”
a non-linear function (ie. sigmoid) modulates the neuron output to acceptable values.
The Linear Algebra View of a 2-layer neural network
The second layer (hidden) between the input and output layers corresponds to adding another weight matrix to the network
f(x, W1, W2) = sig(W2 sig(W1x))
Two layered networks can represent any _____ function
continous
Three layered neural networks can represent any ____ functions
(leave blank)
theoretically 3 layer neural networks can represent any function although in practice it may require exponentially large number of nodes
What is the general framework for NN Computation graphs
- Directed Acyclic graphs (DAG)
- Modules in graph must be differentiable for gradient descent
- Training algorithm processes the graph one module at a time
- compositionality is achieved by this process
Computation Graph example for NN
-log (1 / (1 + e-wx) )

Overview of backpropagation
- Calculate the current model’s outputs (outputs from hidden layer l - 1)
- aka forward pass
- Calculate the gradients for each model
- aka backward pass
Backward pass algorithm “backpropagation”
- start at a loss function to calculate gradients
- calculate gradients of the loss w.r.t. module’s parameters
- progress backwards through modules
- given gradient of the output, compute gradient of the input and pass it back
- end in the input layer where no gradient needs to be computed.

Backpropagation is the application of ___ to a ___ via the ___
- gradient descent
- computation graph
- chain rule
How do you compute the gradients of the loss (circled below)?


reverse-mode automatic differentiation
- Given an ordering (ie. a DAG), iterate from the last module backwards, applying the chain rule, then store (for each node), its gradient outputs for efficient computation
- forward pass: store activation
- backwards pass: store the gradient outputs
*
Auto-Diff
A family of algorithms for implementing chain-rule on computation graphs
What computation is performed for gradients from multiple paths?


Patterns of Gradient Flow: Addition
Addition operation distributes gradients along all paths

Patterns of Gradient Flow: Multiplication
Multiplication operation is a gradient switcher (multiplies it by the value of the other term)

Patterns of Gradient Flow: Max Operator
Gradient flows along the path that was selected to be the max (which must be recorded in the forward pass)
What is one of the most important aspects in deep neural networks that can cause learning to slow or stop if not done properly?
* the flow of gradients *
What is forward- mode automatic differentiation
start from inputs and propagate gradients forward (no backwards pass)
*not common in deep learning because inputs are large (ie. images) and outputs (loss) are small
Differentiable programming
- Computational graphs are not limited to mathematical functions
- can have control flows (statements, loops)
- backpropagate through algorithms
- done dynamically so gradients are computed then nodes are added in repeat
Derivative of sigmoid?
sigmoid(x)*(1 - sigmoid(x))
derivative of cos(x)
-sin(x)
derivative of tanh(x)
sec2(x)