Neural Networks Flashcards
sigmoid
What does the activation function look like?
What’s the equation?
Output from 0-1
hyperbolic tangent
What does the activation function look like?
What’s the equation?
Output from 0-1
What’s the logistic function?
Same as sigmoid
What are some characteristics of tanh?
- Helps to avoid vanishing gradients problem
Comparse sigmoid and tanh
Tanh converges faster
What’s a problem with ReLU? How is it addressed?
Derivative at x<0 is 0
One solution: Instead use an ELU (exponential linear unit)
Describe Relu (2) and ELU (2)
In general, ELU outperforms ReLU?
What’s one thing you could do to convert a NN classifier to a regressor?
“chop off” sigmoid activation function and juset keep output linear layer which ranges from -infinity to +infinity
What’s an objective function you could use for regression?
Quadratic Loss:
the same objective as Linear Regression –
i.e. mean squared erroradd an additional “softmax” layer at the end of our network
What an objective function you could use for classification?
Cross-Entropy:
- -the same objective as Logistic Regression
- – i.e. negative log likelihood
- – This requires probabilities, so we add an additional “softmax” layer at the end of our network
- “any time you’re using classification, this is a good choice” - Matt
What does a softmax do? (1)
Takes scores and transforms them into a probability distribution
For which functions does there exist a one-hidden layer neural network that achieves zero error?
Any function
What is the Universal Approximation Theorem? (2)
- a neural network with 1 hidden layer can approximate any continuous function for inputs within a specific range. (might require a ridiculous amount of hidden units, though)
- If the function jumps around or has large gaps, we won’t be able to approximate it.
What do we know about the objective function for a NN? (1)
It’s nonconvex, so you might end up converging on a local min/max
Which way are you stepping in SGD?
Opposite the gradient (verify this)
What’s the relationship between Backpropagation and reverse mode automatic differentiation?
Backpropagation is a special case of a more
general algorithm called reverse mode automatic differentiation
What’s a benefit of reverse mode automatic differentiation?
Can compute the gradient of any differentiable function efficiently
- When can we compute the gradients for an arbitrary neural network?
(question from lecture)
When can we make the gradient computation for an arbitrary NN efficient?
(lecture question)
What are the ways of computing gradients? (4)
- Finite Difference Method
- Symbolic Differentiation
- Automatic Differentiation - Reverse Mode
- Automatic Differentiation - Forward Mode
Describe automatic differentiation - reverse mode (a pro, con, and requirement)
- Note: Called Backpropagation when applied to Neural Nets –
- Pro: Computes partial derivatives of one output f(x)iwith respect to all inputs xj in time proportional to computation of f(x) –
- Con: Slow for high dimensional outputs (e.g. vector-valued functions) –
- Required: Algorithm for computing f(x)
Describe automatic differentiation - forward mode (a pro, con, and requirement)
- Note: Easy to implement. Uses dual numbers.
- Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional to computation of f(x) –
- Con: Slow for high dimensional inputs (e.g. vector-valued x)
- Required: Algorithm for computing f(x)
Describe the finite difference method (a pro, con, and requirement)
When is it appropriate to use?
- Pro: Great for testing implementations of backpropagation –
- Con: Slow for high dimensional inputs / outputs
- Con: In practice, suffers from issues of floating point precision
- Required: Ability to call the function f(x) on any input x
- only appropriate to use on small examples with an appropriately chosen epsilon
Describe symbolic differentiation (2 notes, a pro, con, and requirement)
- Note: The method you learned in high-school –
- Note: Used by Mathematica / Wolfram Alpha / Maple –
- Pro: Yields easily interpretable derivatives –
- Con: Leads to exponential computation time if not carefully implemented –
- Required: Mathematical expression that defines f(x)
Key thing about backprop
we’re storing intermediate quantities
Why is backpropagation efficient?
- Reuse in the forward computation
- Reuse in the backward computation
What are the gradients that we store from the forward pass in backprop?
- The gradients of the objective function with respect to:
- each parameter
- The bias
What’s one important thing to remember about back propagation?
All gradients are computed before updating any parameter values
Say the main steps of the pseudocode for SGD with backpropagation for a NN
What is y*?
The true label
What is y^?
The predicted label
- What is α(1)?
- What is its shape?
- The parameter matrix of the first layer?
- Two dimensional
When doing matrix multiplication, what’s a useful thing to remember?
Each entry of the resulting matrix is at the row of the 1st matrix and the column index from the 2nd matrix
What’s a fully connected layer?
layers where all the inputs from one layer are connected to every activation unit of the next layer
What’s the difference between SGD and backprop?
- Back propagation is computing the gradients
- SGD is updating the parameters
What’s the derivative of sigmoid, s?
s(1-s)
Define topological order
of a directed graph: a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering.
Say the general pseudocode for backpropagation
What do we know about the computation graph for a NN?
Typically a directed acyclic graph
What are some uses of a convolution matrix?
used in image processing for tasks such as edge detection, blurring, sharpening, etc.
What is the identity convolution?
Just has a 1 in the middle and zeros for every other part of the kernel, so it doesn’t change the image at all (except make it smaller if you don’t do any padding)
What’s blurring convolution?
Have higher values in the middle of the kernel.
Get a blurred version of the image
What’s the basic idea of convolution?
What is a stride?
Amount of pixels by which you slide the kernel at each step
What’s the difference between CNNs and the basic feed forward NNs we’ve been talking about?
The CNNs have a convolution layer and max pooling layer
What’s the key idea of CNNs?
Treat the convolution matrix values as parameters and learn them
What is a convolution matrix?
The matrix that has the values that are multiplied by each pixel in the inner product
What is downsampling?
Weights of convolution are fixed to a uniform distribution (i.e. they’re all the same)
What is max pooling?
- A form of downsampling?
- Instead of having weights in the convolution matrix, take the max value of those in the window
What do CNNs use to train?
Back propagation
Do we minimize or maximize the objective function? What are the implications?
- The convention in this class is minimize the objective function.
- So if you need to maximize an objective function, you need to put a minus sign in front of the objective function and minimize it
If you have a subdifferentiable function e.g. relu what do you do to perform back prop?
Take any line in the set of the slopes of the tangent lines?
If you want a fully connected layer following a 3D tensor, what do you do?
Stretch the 3D tensor out into a long vector, then matrix multiply it by a weigh matrix and it will result in a linear layer