Gradient Descent Flashcards

Question 1

Q

What is GD?

Answer

A

Guided search from a random starting point

Optimization problem

Question 2

Q

What features of the error surface help us with GD?

Answer

A

Convex bowl-shaped

2. Single global minimum

Question 3

Q

Steps in GD

Answer

A

1, Starts by selecting a random point within weight space (i.e. each weight is assigned random values within a sensible range).

Calculate SSE/Error associated with this point based on the preds made for each instance in the training set using the randomly selected weights. This defines a point on the error surface.
We know very little else about the relative position of this point in the error surface -> localized information
Possible to determine the slope of the error surface by determining the derivative of the function used to generate it
Random selected weights are adjusted slightly in the direction of the error slope gradient to move to a new position on the error surface closer to the global minimum.
Adjustment of weights (weights considered independently) is repeated over and over again until the global min is reached
Eventually algo converges to a point on the error surface where any subsequent change to weights do not lead to noticeably better model (within some tolerance). Global min = most accurate predictive model

Question 4

Q

GD pseudo-code

Answer

A

Require: D
Require: learning rate
Require: errorDelta function
Require: convergence criterion

w = random weight in weight space
repeat
for w[j] in w do:
4 . w[j] = w[j] + lr*errorDelta(D, w[j])
until convergence

Question 5

Q

Batch GD

Answer

A

Only 1 adjustment made to each weight at each iteration of the algorithm, based on summing the squared error made by the candidate model for each instance in the training dataset (error based on models over every training instance)
straightforward
accurate
reasonably efficient
widely used

Question 6

Q

Stochastic GD

Answer

A

An adjustment to each weight is made based on the error in the prediction made by the candidate model for each training instance.

Means more adjustments to weights

Question 7

Q

Learning rate

Answer

A

Determines size of adjustments made to each weight at each step in the process

Small lr: will eventually converge to a global minimum, but will take a long time

Large lr: Cause jumps from one side of the error surface to another, strong change global min will be missed, and may cause the error to increase rather than decrease, leading to a process that will never converge

Question 8

Q

Initial random weights guidelines

Answer

A

[-0.2, 0.2]

Question 9

Q

Learning weight decay

Answer

A

Allows the learning rate to start at a large value and then decay over time according to a predefined schedule

lr[t] = lr[0] * (c/(c+T))

where T = current iteration of GD algo
C = Constant controlling how quickly the learning rate decays (often set quite large to 100)

Question 10

Q

Alternative to GD

Answer

A

An interesting alternative to gradient descent is the population-based training algorithms such as the evolutionary algorithms (EA) and the particle swarm optimisation (PSO).

The basic idea behind population-based approaches is that a population of candidate solutions (NN weight vectors) is created, and the candidate solutions iteratively explore the search space, exchanging information, and eventually converging on a minima.

Because many starting points (candidate solutions) are used, the chances of converging on the global minima are significantly increased. PSO and EA have been shown to perform very competitively, often (albeit not always) outperforming gradient descent on complex NN training problems.

Cons: prone to overfitting by aggressively looking to optimize training criterion

Question 11

Q

MLP applications

Answer

A

NETTalk - producing phonemes from text

Protein SS prediction - label each amino acids as helix/coil/strand - 64% accuracy

Handwritten digit recognition (LeCun 1990)

Gradient Descent Flashcards

(11 cards)