Gradient Descent Flashcards
What is GD?
Guided search from a random starting point
Optimization problem
What features of the error surface help us with GD?
- Convex bowl-shaped
2. Single global minimum
Steps in GD
1, Starts by selecting a random point within weight space (i.e. each weight is assigned random values within a sensible range).
- Calculate SSE/Error associated with this point based on the preds made for each instance in the training set using the randomly selected weights. This defines a point on the error surface.
- We know very little else about the relative position of this point in the error surface -> localized information
- Possible to determine the slope of the error surface by determining the derivative of the function used to generate it
- Random selected weights are adjusted slightly in the direction of the error slope gradient to move to a new position on the error surface closer to the global minimum.
- Adjustment of weights (weights considered independently) is repeated over and over again until the global min is reached
- Eventually algo converges to a point on the error surface where any subsequent change to weights do not lead to noticeably better model (within some tolerance). Global min = most accurate predictive model
GD pseudo-code
Require: D
Require: learning rate
Require: errorDelta function
Require: convergence criterion
- w = random weight in weight space
- repeat
- for w[j] in w do:
4 . w[j] = w[j] + lr*errorDelta(D, w[j]) - until convergence
Batch GD
- Only 1 adjustment made to each weight at each iteration of the algorithm, based on summing the squared error made by the candidate model for each instance in the training dataset (error based on models over every training instance)
- straightforward
- accurate
- reasonably efficient
- widely used
Stochastic GD
An adjustment to each weight is made based on the error in the prediction made by the candidate model for each training instance.
- Means more adjustments to weights
Learning rate
Determines size of adjustments made to each weight at each step in the process
Small lr: will eventually converge to a global minimum, but will take a long time
Large lr: Cause jumps from one side of the error surface to another, strong change global min will be missed, and may cause the error to increase rather than decrease, leading to a process that will never converge
Initial random weights guidelines
[-0.2, 0.2]
Learning weight decay
Allows the learning rate to start at a large value and then decay over time according to a predefined schedule
lr[t] = lr[0] * (c/(c+T))
where T = current iteration of GD algo
C = Constant controlling how quickly the learning rate decays (often set quite large to 100)
Alternative to GD
An interesting alternative to gradient descent is the population-based training algorithms such as the evolutionary algorithms (EA) and the particle swarm optimisation (PSO).
The basic idea behind population-based approaches is that a population of candidate solutions (NN weight vectors) is created, and the candidate solutions iteratively explore the search space, exchanging information, and eventually converging on a minima.
Because many starting points (candidate solutions) are used, the chances of converging on the global minima are significantly increased. PSO and EA have been shown to perform very competitively, often (albeit not always) outperforming gradient descent on complex NN training problems.
Cons: prone to overfitting by aggressively looking to optimize training criterion
MLP applications
NETTalk - producing phonemes from text
Protein SS prediction - label each amino acids as helix/coil/strand - 64% accuracy
Handwritten digit recognition (LeCun 1990)