Learning rate Flashcards
Q: How does the choice of the learning rate α affect gradient descent?
A: It has a huge impact on the efficiency; if chosen poorly, gradient descent may not work at all.
Q: What happens if the learning rate α is too small in gradient descent?
A: The algorithm will take very small steps, resulting in slow convergence and taking a long time to reach the minimum.
Q: What is the effect of a too large learning rate α in gradient descent?
A: It can cause the algorithm to overshoot the minimum, potentially even causing it to diverge and fail to converge.
Q: What does the derivative term dw/d J(w) indicate in gradient descent?
A: It indicates the direction and rate of the steepest ascent, guiding adjustments to the parameter w.
Q: What is the outcome when the gradient descent reaches a local minimum?
A: The derivative becomes zero, leading to no change in the parameter w, thus maintaining the local minimum position.
Q: How does the gradient descent behave if the parameter w is already at a local minimum?
A: The parameter w remains unchanged because the update rule results in w=w−α⋅0.
Q: What will happen to the cost J if the learning rate α is properly chosen?
A: The cost J will gradually decrease until it reaches a local or global minimum.
Q: How does gradient descent ensure convergence to a minimum with a fixed learning rate α?
A: As w approaches the minimum, the derivative decreases, resulting in smaller updates and thus gradual convergence.
Q: What does it mean if the derivative term in gradient descent is large?
A: The update step will be larger, indicating a steeper slope and a need for a larger adjustment
Q: What is the relationship between the slope of the cost function and the size of the steps in gradient descent?
A: A steeper slope (larger derivative) results in larger steps, while a flatter slope (smaller derivative) results in smaller steps.
Q: How does gradient descent behave if w is initialized near the minimum and α is too large?
A: It may overshoot the minimum and keep bouncing back and forth without converging.
Q: Why is it crucial to choose an appropriate learning rate α in gradient descent?
A: To ensure efficient convergence to a minimum without overshooting or taking excessively small steps.