Quiz 1 Flashcards

1
Q

Gradient Descent

A

Gradient Descent is an algorithm which is used in deep learning such as in Neural Networks.

Gradient descent works by taking the current weights and subtracting the partial derivative of the current loss w.r.t. the weights which is multiplied by some learning rate \alphaα.

As the weights change, the loss changes as well. This is often somewhat smooth locally, so small changes in wrights produce small changes in loss. We can think about iterative algorithms that take current values of weights and modify them a little bit to reduce loss and then repeats this until finding the local minimum.

We can find the steepest descent by computing the derivative

f’(a) = lim f(a+h) - f(a) / h

The steepest descent direction is the negative gradient
Intuitively: Measures how the function changes as the argument changes by a small step size, as the step goes to 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Examples of non-parametric model

A

KNN and decision tree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Components of a parametric learning algorithm

A
Input(and representation)
Functional form of the model
Performance measure to improve
Loss or objective function
Algorithm for finding the best parameters
Optimization algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

WHat are methods for finding best set of weights

A

Random search, genetic, algorithms, greaident based learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Full batch gradient descent

A

we calculate loss as loss of average loss across all items in our dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Small batch gradient descent

A

we take a small subset of data and calculate the loss over those examples and then we computer gradient and take a step in the weight space and then we take another minibatch. We often average loss over mini - batches to prevent large changes in he learning rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

WHen is gradient descent guarenteed to converge?

A

GD is guaranteed to converge in limited circumstances, The LR has t be appropriately reduced. GD converges to local minima. SOme of the local minima it finds are pretty strong.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe distributed representation as it pertains to deep learning:

A

NO single neuron encodes everything and groups of neurons work together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

SImilarities between LInear classifier and Neural Network

A

A neuron takes input(firings) from other neurons(-> input to linear classifier)
The inputs are summed up in a weighed manner(-> weighted sum).
Learning is through a modification of the weights
If it received enough input, it fires(threshold or if weighted sum plus bias is high enough

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How many layers does it take to learn any continuous function

A

Two layered networks can represent any continuous function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

HOw many layers does ti take to learn any function?

A

3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Examples of activation functions

A

sigmoid -> 1/1+e^-x

reLU -> max(0, h to the l-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

WHat is regularization for?

A

To help with overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

L1 Regularization

A

L1 Regularization uses the regularization term |W|∣W∣ which is the L1 norm of the weight matrix. L1 Regularization will encourage the network to learn weights that are small. L1 regularization promotes the network to not rely on any particular weight while also promoting sparsity due to the weights being very small, very close to zero and only a few non zero values.

∣y −Wxj| ^2 + |W|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

When is gradient descent guaranteed to converge?

A

Gradient Descent is guaranteed to converge IF the learning rate is adequately decayed. Gradient descent does not guarantee convergence to the global minima, rather some local minima, which could be the global minima.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Local Minima vs Global Minima

A

Gradient Descent Properties such as the local minima can be defined as the point (locally) which produces results in which the loss is no longer reducing or has been reduced sufficiently (sufficiently being subjective), which is another way of saying that small changes in the weights would not decrease the loss.

The global minima does not mean the same thing as the local minima, but the global minima can equal the local minima.

The global minima is the lowest point of the entire space. Local minima is the lowest spot in a neighborhood (sub-region).

17
Q

When is something considered to be differentiable?

A

A function is differentiable if the function is continuous, meaning the graph of the function must not be broken. There cannot be ‘corners’ in the function as well as cannot having any points which have undefined slopes.

Every point is differentiable

18
Q

Derivative of sin(x)

A

cos(x)

19
Q

Derivative of cos(x)

A

-sin(x)

20
Q

Derivative of tan(x)

A

sec^2(x)

21
Q

Derivative of cot x

A

-csc^2x

22
Q

Derivative of sec x

A

tan x sec x

23
Q

Derivative of csc x

A

-cot x csc x

24
Q

What are the 3 ways we know a function is differentiable?

A

1) Cusp or corner(sharp turn)
2) Discontinuity(jump, point, or infinite)
3) Vertical Tangent(undefined slope)

25
Q

Piecewise differentiable

A

You are just checking each piece for differentiability

26
Q

How to prove a function is continuous

A

1) f(a) is defined
2) Show that the limit of f(x) as a approaches 0 exists
3) Show that the limit of f(x) is equal to the limit of f(a)

27
Q

L2 regularization

A

L2 Regularization used the regularization term |W|^2 , which is the sum of the squares of the weights. L2 regularization adds “squared magnitude” of coefficient as penalty term to the loss function.

|y - Wxj|^2 + |W|^2

28
Q

Backpropagation computational graph

A

1) Calculate the model current outputs(called the forward pass) on some mini batch
2) Calculate the gradients for each module(called the backward pass) w.r.t parameters
2a) FOr the backward pass, we start with the loss function where we know how to calculate the loss. We calc the change in loss with respect to the parameters in order to perform gradient descent. We calculate the change in loss with respect to the inputs in order to send us back to the previous layer

Now the previous layer knows how the loss changes when the output changes. It uses the chain rule to know how the loss changes if our inputs change or if our weights or parameters change. The upstream gradient will then be passed back to our first layer

2a) We pass back gradients to each layer, specifically the partial derivative of the Loss with respect to the models parameters. The partial derivative with respect to the input from the prior layer - dl with respect to d h ^l -1, dl with respect to the dh^l. These derivatives will give us a sense of how our loss will change if we change these parameters
2b) We end at the input layer
3) Use gradient descent to update parameters at the end. Since we have the partial derivatives for the weights with respect to all the weights, we can simply perform wi - wi - alpha * dl/dwi where alpha is equal to some learning rate.

Back prop is just the application of gradient descent to a computation graph via gradient descent

29
Q

Upstream gradient

A

The derivative of the loss with respect to h^l - dl/dh^l

30
Q

Calculating local gradients

A

It is the change of outputs with respect to our inputs or the change in outputs with respect to our weights. We need the first derivative so we can pass it back to previous modules. We need the second derivative for gradient descent so we can update the weights themselves.

{dh^l/dh^l-1 , dh^l/dw}

31
Q

Chain rule to compute the gradient of the loss w.r.t the inputs, otherwise known as the upstream gradient

A

Chain rule: dz / dx= dz/dy * dy/dx

The gradient of the loss with respect to it’s inputs is equal to the gradient of the loss with respect to it’s outputs * by the gradient of its outputs with respect to its inputs.

In order for us to know how the loss changes with respect to the inputs, we also need to know how the loss changes with respect to the outputs.

This is considered the upstream gradient because it is what we receive from the subsequent layer.

dl/dh^l-1 = dl/dh^l-1 * dh^l/dh^l-1

32
Q

Chain rule to compute the gradient of the loss w.r.t the weights,

A

dl/dW = al/ah^l * ah^l/aW

The change in loss with respect to our inputrs is multiplied by change in our outputs with respect to our models parameters