Lecture 8 - (Stochastic) Gradient Descent, Regularization, Artificial Neural Networks, Perceptron Flashcards
What is Gradient Descent?
Gradient Descent is an optimization algorithm. It is capable of finding optimal solutions to a wide range of problems.
What is the complexity of Gradient Descent?
O(ndt) to run for t iterations (n = number of elements, d = number of features)
Explain the process of Gradient Descent step-by-step.
- Take the derivative of the Loss Function for each parameter in it
- Pick random values for the parameters
- Plug the parameter values into the derivatives
- Calculate the Step Size = Slope x Learning Rate
- Calculate New Parameters = Old Parameters - Step Size
- Then, go back to step 3 and repeat until step size is very small, or you reach the maximum number of steps
How does Gradient Descent Calculate step size?
Slope x Learning Rate
How does Gradient Descent know where to stop descending on the curve to find the optimal value?
When the step size is very close to 0
True or False. The step size is proportional to the Loss Function slope (in Gradient Descent)
True. That is how Gradient Descent changes step size constantly (decreases gradually).
What kind of parameter is the learning rate in Gradient Descent?
Hyperparameter
What are the disadvantages of Gradient Descent?
The key practical problems are:
- converging to a local minimum can be quite slow
- if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)
There is a change it is not going to reach the global minimum, either due to a plateau or due to a local minimum
What is the difference between Gradient Descent and (Stochastic) Gradient Descent
As compared to Regular Gradient Descent, Stochastic Gradient Descent would randomly pick one sample for each step, and just use that one sample to calculate the derivatives
True or False. In practice, as in theory, Stochastic Gradient Descent only works with ONE sample taken for each step.
False.
In practice, it is common to select a small subset of data (a mini-batch) for each step → takes the best of both words between using one sample, and the whole data → it is faster than using all of the data, and yields more stable parameters than using only one sample
Is Stochastic Gradient Descent going to yield better outcomes than the Gradient Descent?
Probably not, because taking just samples of data is not the best approach. But it is good enough when the other option is very time consuming and very heavy computationally
What is the purpose of regularization?
To add a penalty to the complexity of a model in order to avoid overfitiing.
What are the two ways (we studied) to perform regularization?
L2 - Ridge Regression
L1 - Lasso Regression
The Ridge Regression line (the equation it minimizes) is equal to …
the sum of the squared residuals (if linear regression) + lambda x slope ^2
Essentially, what does Ridge Regression do?
Ridge Regression can improve predictions from new data by making the prediction less sensitive to the training data. Especially when sample sizes are relatively small
What is lambda in Ridge Regression?
Lambda essentially says how harsh the Ridge Regression penalty should be (basically it controls the “strength” of the regularization
It can take values between 0 and infinity.
- when LAMBDA = 0 the Ridge Regression Penalty is also 0 → the RRL will only minimize the sum of squared residuals and the RRL will be the same Least Square Line
- when LAMBDA = 1 → smaller slope than the sum of squared residuals line
- … and then it gets smaller, as the larger the LAMBDA.
- So, the larger we make lambda, the prediction for y becomes less and less sensitive to X
How do you decide which lambda to use in a Ridge Regression?
We just try a bunch of values for LAMBDA and use Cross-validation (typically 10-fold) to determine which one results in the lowest variance
Which regression models does Ridge Regression work with?
RR also works with discrete X to predict something continuous & Logistic Regression (but instead of Squared Residuals, RR tries to minimize the sum of the Likelihoods (or negative log-likelihoods, i am not sure) - as LG is solved using Maximum Likelihood) & even more complicated models
In simple words, what do low bias and high variance mean?
Low bias means that the regression line fits training data well.
High variance means that the regression line fits testing data poorly.
What is the difference between Lasso (L1) and Ridge (L2) regression?
They have the same goal, but the difference stands in the equation that they try to minimize.
if we take the equation that Ridge Regression minimizes and we plug in the absolute value of the slope instead the squared slope, we get the equation that Lasso Regression minimizes
The Lasso Regression line (the equation it minimizes) is equal to …
the sum of squared residuals (for linear regression) + lambda x |slope|