DL Fundamentals Flashcards
Representational learning
Engineering representations is hard - requires technical and domain expertise
- Representational learning is a set of method that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification (LeCun et al., 2014)
List some activation functions
g(z) = 1 / (1 + e^(-z))
Loss functions
- Squared error
2. Log loss
Squared error loss function
1/2 * (M(d) - t)^2
Log loss function
-((t * log(M(d)) + (1-t)*(log(1-M(d)))
Difference between loss and cost function
- Loss function = measure of the prediction error on a single training instance
- Cost function = measure of the average prediction error across a set of training instances
- Cost functions allows us to add in regularization
Gradient Descent Algorithm
- Choose random weights
- Until convergence
- Set all gradients to zero
- For each training instance
- Calculate the model output
- Calculate loss
- Update gradient sum for for each weight and bias
- Update weights and bias using the weight update rule
Backpropagation Algorithm
- Werbos (1974)
- Not widely used until 1986
- Initialize the weights to random small values (fanin)
- FF phase - Feed input data through the network from the inputs to the outputs
- Update the training error for the network based (based on target values for all output nodes)
- Error propagation phase - Feed error values back through the network, adjusting the weights along the way
- Repeat from 2 until the error values are sufficiently small or some other stopping condition
Forward pass algo
Require: L, network depth
Require: W[i], i is an element of {1…L}, weight matrices for each layer
Require: b[i], i is an element of {1…L}, bias terms for each . layer
Require: d, input descriptive features
Require: t, target features
a[0] = d
for I = 1 to L:
z[I] = W[I]*a[I-1] + b
a[I] = g[I]z[I]
M(d) = a[L] Calculate L(M(d), t)
Backward Propagation algo
Require: A forward pass of network
Require: L, network depth
Require: W[i], i is an element of {1…L}, weight matrices for each layer
Require: b[i], i is an element of {1…L}, bias terms for each layer
Require: t, target features
Calculate da[L] #derivate of loss function for I=L to 1: dz[I] = da[I] * g[I]'(z[I]) dW[I] = dz[I] * a[I-1]T db[I] = dz[I] da[I-1] = W[I]T*dz[I]
Stochastic/Online GD
- Choose random weights
- Until convergence
- Shuffle all training instances
- For each training instance:
- Perform f/w pass
- Perform b/w pass
- Update weights and biases using update rule
Stochastic GD Update Rule
W[i] = W[i] - αdW[i] b[i] = b[i] - αdb[i]
GD Batch
- Choose random weights
- Until convergence
- Set all gradient sums to 0
- For each training instance:
- Perform f/w pass
- Perform b/w pass
- Update gradient sum for each weight and bias term
- Update weights and biases using update rule
Batch GD Update rule
W[i] = W[i] - α(1/m)Σ(from j=0 to M) dW[i]j b[i] = b[i] - α(1/m)Σ(from j=0 to M)db[i]j
GD Mini-batch
- Choose random weights
- Until convergence
- Divide the training set into mini-batches of size s
- For each mini-batch D(mb)
- Set all gradient sums = 0
- For each training instance in D(mb)
- Perform f/w pass
- Perform b/w pass
- Update gradient sum for each weight and bias t
- Update weight and bias term using update rule