Quiz 1 Flashcards
Gradient Descent
Gradient Descent is an algorithm which is used in deep learning such as in Neural Networks.
Gradient descent works by taking the current weights and subtracting the partial derivative of the current loss w.r.t. the weights which is multiplied by some learning rate \alphaα.
As the weights change, the loss changes as well. This is often somewhat smooth locally, so small changes in wrights produce small changes in loss. We can think about iterative algorithms that take current values of weights and modify them a little bit to reduce loss and then repeats this until finding the local minimum.
We can find the steepest descent by computing the derivative
f’(a) = lim f(a+h) - f(a) / h
The steepest descent direction is the negative gradient
Intuitively: Measures how the function changes as the argument changes by a small step size, as the step goes to 0
Examples of non-parametric model
KNN and decision tree
Components of a parametric learning algorithm
Input(and representation) Functional form of the model Performance measure to improve Loss or objective function Algorithm for finding the best parameters Optimization algorithm
WHat are methods for finding best set of weights
Random search, genetic, algorithms, greaident based learning
Full batch gradient descent
we calculate loss as loss of average loss across all items in our dataset.
Small batch gradient descent
we take a small subset of data and calculate the loss over those examples and then we computer gradient and take a step in the weight space and then we take another minibatch. We often average loss over mini - batches to prevent large changes in he learning rate.
WHen is gradient descent guarenteed to converge?
GD is guaranteed to converge in limited circumstances, The LR has t be appropriately reduced. GD converges to local minima. SOme of the local minima it finds are pretty strong.
Describe distributed representation as it pertains to deep learning:
NO single neuron encodes everything and groups of neurons work together.
SImilarities between LInear classifier and Neural Network
A neuron takes input(firings) from other neurons(-> input to linear classifier)
The inputs are summed up in a weighed manner(-> weighted sum).
Learning is through a modification of the weights
If it received enough input, it fires(threshold or if weighted sum plus bias is high enough
How many layers does it take to learn any continuous function
Two layered networks can represent any continuous function
HOw many layers does ti take to learn any function?
3
Examples of activation functions
sigmoid -> 1/1+e^-x
reLU -> max(0, h to the l-1)
WHat is regularization for?
To help with overfitting
L1 Regularization
L1 Regularization uses the regularization term |W|∣W∣ which is the L1 norm of the weight matrix. L1 Regularization will encourage the network to learn weights that are small. L1 regularization promotes the network to not rely on any particular weight while also promoting sparsity due to the weights being very small, very close to zero and only a few non zero values.
∣y −Wxj| ^2 + |W|
When is gradient descent guaranteed to converge?
Gradient Descent is guaranteed to converge IF the learning rate is adequately decayed. Gradient descent does not guarantee convergence to the global minima, rather some local minima, which could be the global minima.