Class 3 4 5 Flashcards
What is a gradient based optimization?
A gradient based opptimization is used to maximize/minimize the objective function.
Minimizing cost/errors..
What is a derivative and how do we calculate it?
A derivative gives a slope of the function f(x)
We can calculate it as limit epsilon goes to zero
f(x+epsilon)-f(x) / epsilon
What is a critical point for gradeints ? and why?
the critical points of the graident is local minima local maxima and saddle points. these points are critical because, the gradeint of those points are zero, therefore those points give us 0, no information. Percepton uses step function but the gradient of it is zero
How to find a minima?
We find a minima, given an objective function,J (z) = z^2
we calculate the derivative (gradient of it) J’(z) = 2z
We start from a random variable assigned to z
we update z according to the opposite direction of the gradient
z = z - alpha. J’(z)
we continue to do that untill it converges
What is a gradient?
A gradient is the generalization of the derivates with respect to vector of input variables, calculations are easy but instead of a single variable now we have a vectoral input.
what is a gradient descend?
Idea: decrease the function f, by moving it in the opposite direction of the gradient.
x’ = x - alpha . J’(x)
alpha is learning rate which is too small
What is the Jacobian matrix?
A jacobian matrix is the matrix of all the partilal derivatives
What is the difference between taking the first derivatve vs second derivative?
First derivative gives us the information about the function and how quickly a function changes, the second order derivative gives us the curvature information and how quickly a derivaative changes.
What does it mean to have the second order derivative equal to zero <= 0 , > = 0
If the second order derivative is zero, it means that there is no curvature and we have a line. If <=0 it means that it is underestimating therefore, the improvement is really bad, >=0 overestimating, the increase or decrease of a function
What is a heissian matrix? Explain everything briefly.
A heissian matrix is the matrix composed of the second order derivatives, there is a condition number of the heissian which can be found by the ratio between tthelargest and the smallest eigen value.
It is really slow because we first need to compute the heisssian matrix.
We dont use it generally in NN because in NN we use first order derivatives.
What is entropy? What is KL divergence? What is cross entropy?
Entropy means that the information available. KL divergence is used to calculate the differences between the two prob. distriibutions P(x) and Q(x), it is not the true distance because they are not symmetric
Miinimizing cross entropy P wrt Q, leads to minimizing the KL divergence
Why using MSE is a bad choice in NNs especially in classification?
When using mean squared error we assume that our data has a normal(gaussian) distribution. But in reality in our data when for example we want to classify, we can classify it in to 2 clases, which is bernoulli and wont be in a normal distribution anymore.
MSE used in linear regression as maximum likelihoodç
What is Maximum lilelihood? What are the improvemnets done?
Maximum likelihood is the probability of the classification of best model trained on the data. P(x,model)
argmax ÇARPIM p(x,model)
but this gives an unstable calculation therefore we calculate it as
argmax TOPLAM log p(x,model)
If we want to minimize the KL divergence, therefore the cross entropy, we need to minimize the Maximum Likelihoodd
What is conditional log likelihood?
Conditional log likelihood is estimating the conditional probability(supervised learning)
argmax TOPLAM log P(y|x,model)
(given y predict x)