Gradient Descent Flashcards

1
Q

What is GD?

A

Guided search from a random starting point

Optimization problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What features of the error surface help us with GD?

A
  1. Convex bowl-shaped

2. Single global minimum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Steps in GD

A

1, Starts by selecting a random point within weight space (i.e. each weight is assigned random values within a sensible range).

  1. Calculate SSE/Error associated with this point based on the preds made for each instance in the training set using the randomly selected weights. This defines a point on the error surface.
  2. We know very little else about the relative position of this point in the error surface -> localized information
  3. Possible to determine the slope of the error surface by determining the derivative of the function used to generate it
  4. Random selected weights are adjusted slightly in the direction of the error slope gradient to move to a new position on the error surface closer to the global minimum.
  5. Adjustment of weights (weights considered independently) is repeated over and over again until the global min is reached
  6. Eventually algo converges to a point on the error surface where any subsequent change to weights do not lead to noticeably better model (within some tolerance). Global min = most accurate predictive model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

GD pseudo-code

A

Require: D
Require: learning rate
Require: errorDelta function
Require: convergence criterion

  1. w = random weight in weight space
  2. repeat
  3. for w[j] in w do:
    4 . w[j] = w[j] + lr*errorDelta(D, w[j])
  4. until convergence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Batch GD

A
  • Only 1 adjustment made to each weight at each iteration of the algorithm, based on summing the squared error made by the candidate model for each instance in the training dataset (error based on models over every training instance)
  • straightforward
  • accurate
  • reasonably efficient
  • widely used
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Stochastic GD

A

An adjustment to each weight is made based on the error in the prediction made by the candidate model for each training instance.

  • Means more adjustments to weights
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Learning rate

A

Determines size of adjustments made to each weight at each step in the process

Small lr: will eventually converge to a global minimum, but will take a long time

Large lr: Cause jumps from one side of the error surface to another, strong change global min will be missed, and may cause the error to increase rather than decrease, leading to a process that will never converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Initial random weights guidelines

A

[-0.2, 0.2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Learning weight decay

A

Allows the learning rate to start at a large value and then decay over time according to a predefined schedule

lr[t] = lr[0] * (c/(c+T))

where T = current iteration of GD algo
C = Constant controlling how quickly the learning rate decays (often set quite large to 100)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Alternative to GD

A

An interesting alternative to gradient descent is the population-based training algorithms such as the evolutionary algorithms (EA) and the particle swarm optimisation (PSO).

The basic idea behind population-based approaches is that a population of candidate solutions (NN weight vectors) is created, and the candidate solutions iteratively explore the search space, exchanging information, and eventually converging on a minima.

Because many starting points (candidate solutions) are used, the chances of converging on the global minima are significantly increased. PSO and EA have been shown to perform very competitively, often (albeit not always) outperforming gradient descent on complex NN training problems.

Cons: prone to overfitting by aggressively looking to optimize training criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MLP applications

A

NETTalk - producing phonemes from text

Protein SS prediction - label each amino acids as helix/coil/strand - 64% accuracy

Handwritten digit recognition (LeCun 1990)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly