Networks Flashcards
What are two alternatives to the Relu function?
ELU(Exponential linear units) and leaky Relu.
what is the difference between SGD with momentum and SGD with Nestrov momentum?
Nestrov momentum adds the velocity to parameter before computing gradients
What is the idea behind Adagrad?
Adagrad reduces the learning rate of gradient dimensions with large square value.
What is the idea behind RMS prop?
Adagrad reduces the learning rate of gradient dimensions with large running mean square value.
What is the idea behind Adam optimizer?
Combines SGD with momentum and RMS prop.
Name some regularizers
L1, L2, Early stopping, Dropout, constrain max norm, data augmentation
Name two commonly used initialization methods for weights in a Neural network
w = N(0, 1/sqrt(N)) or Xavier: U(-1/sqrt(N), 1/sqrt(N))
What is the formula for batch normalization?
z_hat = (z - mu)/sigma z_new = y*z_hat + b, where y and b are learnt parameters.
What is the ouput size after a convolutional layer
out = (in + 2*pad - filter_size)/stride + 1
Why are deep networks harder to train and how can we solve this?
- Vanishing gradient:
- Relu
- Good initializaiton
- Auxiliary classifiers - Covariate shift:
- Batch normalization
What is the idea behind the inception net?
Each layer tries several filter sizes and the networks “learns” the best filter size