Quiz 1 Flashcards
General Concept non AI, ML
AI –> ML –> DL
general intelligence: ability to acquire and apply knowledge and skills / ability to navigate in problem space
AI: intelligence by machines / sci & engineering of making comp behave in ways that human intelligence do
ML: Study of algo that improve performance at some task with experience
Deep Learning Definition
Representation learning, NN, Deep unsupervised/reinforcement, structured learning
A) Hierarchical compositionality
B) End2End Learning (learning goal driven representations, learning to feature extraction)
Feature eng: SIFT Scale-invariant feature transform, Spin Images, HoG Histogram of Oriented Gradient , Textons
[VISION] (fixed) SIFT/HoG -> (unsup) K-means/pooling -> (sup) classifier
C) Distributed Representations ( non single neuron encodes everything, group of neurons work together
Gradient Descent
Gradient Descent is an algorithm which is used in deep learning such as in Neural Networks.
Gradient descent works by taking the current weights and subtracting the partial derivative of the current loss w.r.t. the weights which is multiplied by some learning rate \alphaα.
As the weights change, the loss changes as well. This is often somewhat smooth locally, so small changes in wrights produce small changes in loss. We can think about iterative algorithms that take current values of weights and modify them a little bit to reduce loss and then repeats this until finding the local minimum.
We can find the steepest descent by computing the derivative
f’(a) = lim f(a+h) - f(a) / h
The steepest descent direction is the negative gradient
Intuitively: Measures how the function changes as the argument changes by a small step size, as the step goes to 0
types of gradient descent & coverage
- Full batch gradient descent: we calculate loss as loss of average loss across all items in our dataset.
- Small batch gradient descent: we take a small subset of data and calculate the loss over those examples and then we computer gradient and take a step in the weight space and then we take another minibatch. We often average loss over mini - batches to prevent large changes in he learning rate.
- WHen is gradient descent guarenteed to converge?
GD is guaranteed to converge in limited circumstances, The LR has t be appropriately reduced. GD converges to local minima. SOme of the local minima it finds are pretty strong.
Gradient Descent is guaranteed to converge IF the learning rate is adequately decayed. Gradient descent does not guarantee convergence to the global minima, rather some local minima, which could be the global minima.
parametric model vs non parametric model
- Examples of non-parametric model ; KNN and decision tree
- Components of a parametric learning algorithm;
Input(and representation) Functional form of the model Performance measure to improve Loss or objective function Algorithm for finding the best parameters Optimization algorithm
- WHat are methods for finding best set of weights:
Random search, genetic, algorithms, greaident based learning
SImilarities between LInear classifier and Neural Network:
SImilarities between LInear classifier and Neural Network:
A neuron takes input(firings) from other neurons(-> input to linear classifier)
The inputs are summed up in a weighed manner(-> weighted sum).
Learning is through a modification of the weights
If it received enough input, it fires(threshold or if weighted sum plus bias is high enough
continuous function
How to proof it is continuous:
1) f(a) is defined
2) Show that the limit of f(x) as a approaches 0 exists
3) Show that the limit of f(x) is equal to the limit of f(a)
How many layers does it take to learn any continuous function:
2 layered networks can represent any continuous function
3 for any function
activation functions
examples;
sigmoid -> 1/1+e^-x
reLU -> max(0, h to the l-1)
To help with overfitting
- for overfitting
- L1 Regularization uses the regularization term |W|∣W∣ which is the L1 norm of the weight matrix. L1 Regularization will encourage the network to learn weights that are small. L1 regularization promotes the network to not rely on any particular weight while also promoting sparsity due to the weights being very small, very close to zero and only a few non zero values.
∣y −Wxj| ^2 + |W|
Local Minima vs Global Minima
Gradient Descent Properties such as the local minima can be defined as the point (locally) which produces results in which the loss is no longer reducing or has been reduced sufficiently (sufficiently being subjective), which is another way of saying that small changes in the weights would not decrease the loss.
The global minima does not mean the same thing as the local minima, but the global minima can equal the local minima.
The global minima is the lowest point of the entire space. Local minima is the lowest spot in a neighborhood (sub-region).
Backpropagation computational graph
1) Calculate the model current outputs(called the forward pass) on some mini batch
2) Calculate the gradients for each module(called the backward pass) w.r.t parameters
2a) FOr the backward pass, we start with the loss function where we know how to calculate the loss. We calc the change in loss with respect to the parameters in order to perform gradient descent. We calculate the change in loss with respect to the inputs in order to send us back to the previous layer
Now the previous layer knows how the loss changes when the output changes. It uses the chain rule to know how the loss changes if our inputs change or if our weights or parameters change. The upstream gradient will then be passed back to our first layer
2a) We pass back gradients to each layer, specifically the partial derivative of the Loss with respect to the models parameters. The partial derivative with respect to the input from the prior layer - dl with respect to d h ^l -1, dl with respect to the dh^l. These derivatives will give us a sense of how our loss will change if we change these parameters
2b) We end at the input layer
3) Use gradient descent to update parameters at the end. Since we have the partial derivatives for the weights with respect to all the weights, we can simply perform wi - wi - alpha * dl/dwi where alpha is equal to some learning rate.
Back prop is just the application of gradient descent to a computation graph via gradient descent
Chain rule to compute the gradient of the loss w.r.t the weights,
dl/dW = al/ah^l * ah^l/aW
The change in loss with respect to our inputrs is multiplied by change in our outputs with respect to our models parameters
Upstream gradient
The derivative of the loss with respect to h^l - dl/dh^l
What are the 3 ways we know a function is differentiable?
When is something considered to be differentiable?
3 ways we know a function is differentiable?
1) Cusp or corner(sharp turn)
2) Discontinuity(jump, point, or infinite)
3) Vertical Tangent(undefined slope)
A function is differentiable if the function is continuous, meaning the graph of the function must not be broken. There cannot be ‘corners’ in the function as well as cannot having any points which have undefined slopes.
Every point is differentiable
Calculating local gradients
It is the change of outputs with respect to our inputs or the change in outputs with respect to our weights. We need the first derivative so we can pass it back to previous modules. We need the second derivative for gradient descent so we can update the weights themselves.
{dh^l/dh^l-1 , dh^l/dw}