Deep Learning Flashcards
What does ReLu stand for? And what does it mean?
Rectified Linear Unit
The rectified linear activation function or ReLU for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
What type of neural network architecture is used for eg. House price prediction or advertisement clicking probability?
Standard neural network architecture
What type of neural network architecture is used for image recognition?
CNN (convolution neural network)
For sequence data eg. Audio over time, what type of neural network architecture do we use?
Recurrent Neural Network
What is the vanishing gradient problem?
Optimisation of parameters uses gradient descent method to find the best parameters. Vanishing gradient problem occurs when the gradient becomes exponentially small so that the update of the parameter that we are trying to update becomes insignificant. The implications can be that the model never converges to optimum or it takes much longer to train.
Explain gradient descent
Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.
1) start with random coefficients
2) calculate predicted values
3) Calculate partial derivative w.r.t a0 and a1. Sub in the predicted values.
5) Multiply the value by learning rate and subtract it from coefficient
6) stop after 100 iterations or until the error is Low.
What activation should you use for the output layer of a binary classification and why
Sigmoid because you want to limit the output value in the range of 0 to 1
Ps, in the sigmoid function, when x=0, y=0.5
Name other activation functions more suitable for hidden later
Tanh: similar to sigmoid function but limits the output range to -1 and 1, when x=0, y=0
ReLu: (default choice when you dk what to use because training time is faster as compared to using tanh and sigmoid due to the lack of vanishing gradient) a=max(0,z). Tanh and sigmoid have vanishing gradient problems at the tails
Leaky ReLU: a=max(0.01z, z) when x is negative, instead of the slope being zero, there is a small slope. The constant 0.01 can be another learning parameter
What type of activation function should you use for a regression problem where the output is non-negative
Relu
Why is there a need for non-linear activation functions
For networks with more than 1 hidden layer, if one were to use linear activation functions for all layers, the output will still be the same as that of 1 hidden layer with a linear activation function.
What are the two phases in neural network?
During forward propagation, the input is fed into the neural network, and the network calculates the output. During backward propagation, the error between the predicted output and the actual output is calculated, and the weights and biases of each neuron are adjusted to reduce the error.
What are the common steps for pre-processing a new dataset
- Figure out dimensions and shapes of problem (m_train, m_test, num_px)
- Reshape the dataset such that each example is now a vector of size
- Standardised the data
In a neural network with 1 hidden layer with 3 nodes and 4 input features what is the shape of the weights matrix in layer 1
(3,4)
The number of rows in W is the number of neurons in that layer and the number of columns is the number of input of the layer
How to build a 2 layer neural network?
1) Initial parameters:
Weights matrix 1 with shape (size of hidden layer, size of input layer) | all random weights
Bias vector 1 with shape (size of hidden layer, 1) | all zero
Weights matrix 2 with shape (size of output layer, size of hidden layer) | all random
Bias vector 1 with shape (size of output layer, 1) | all zero
2) Forward propagation:
Z = np.dot(W, A) + b
Where A is the activation from prev layer or the input data
B is the bias vector
W is the weight matrix
3) Calculate the activation for the layer by applying the activation function to z
g(z)
4) Compute the cost function
- if it’s a regression problem, cost functions are MAE, MSE, RMSE
- If it is a classification problem, cost functions are cross entropy loss
5) Backward propagation
- compute the derivative of cost function with respect to AL ( probability vector)
- Use dAL to calculate the derivative of the cost function with respect to z
- Use dZ to calculate the calculate the derivative of the cost function with respect to W, b
6) Update the parameters
- The new W and B is updated by subtracting the learning rate * by the gradient computed in backward propagation
7) Repeat steps 2 to 6 for a set number of iterations like 1000 times or until the cost is at a satisfactory level
How does l2 regularisation work in neural networks
if lambda, the regularization parameter is large, then your parameters will be relatively small, because they are penalized being large in the cost function. And so if the weights, W, are small, then because z is a function of W, if W tends to be very small, then z will also be relatively small. And in particular, if z ends up taking relatively small values, just in this little range, then g of z will be roughly linear. So it’s as if every layer will be roughly linear, as if it is just linear regression. And we saw in course one that if every layer is linear, then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is, at the end of the day, only able to compute a linear function. So it’s not able to, you know, fit those very, very complicated decision, very non-linear decision boundaries that allow it to, you know, really overfit
How does dropout regularisation work in NN
- each layer has a probability of keeping a node
- Keep prob lower for bigger weight matrix (eg, hidden layer) to increase drop out and higher fro layers with less nodes
- Intuition, can’t rely on any one feature, so have to spread out of weight
- For drop out the cost function will get fucked up, turn off drop out to ensure cost is dropping (ensure model is working) then turn it back
What are the ways to speed up mini batch gradient descent and explain how
1) Momentum gradient descent
Problem with mini batch gradient descent is that it may take many steps to get to minimum due to noise of the batches, causing the cost of oscillate before reaching the minimum.
Momentum gradient descent reduces the number of steps taken using by smoothing out the movement using exponentially weighted average
The smoothing constant, beta, is a hyper parameter using 0.9 which is average of 10 iterations
2) RMSprop
- known as root mean square prop
- The vertical axis rep b and the horizontal axis rep w.
- The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
- Sdw=betasdw + (1-beta)dw^2 (element wise)
- W=w- alphadw/sqrt(sdw)
Why the need for learning rate decay?
At the start of learning, can afford to take bigger steps. But when learning rate is large, nearing the minima of gradient descent, the algo might wander around and not reach the minima due to the large noisy steps, hence using learning rate decay makes the lr smaller and hence the steps smaller towards the end of the training to better find the minima.
What is one epoch
One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters. So if you have 500 mini batches of 100, the number of iterations is 500 ie the parameters have been updated 500 and all 50,000 samples have been seen by the model
How is the learning rate alpha updated in LR decay
It is updated after each epoch,
Where alpha next = (1/1+decay rate epoch num)alpha prev
Other methods
Exp decay: Alpha = 0.95 ^ epoch num * alpha prev
How to sample values on a log scale
1) determine upper and lower limit
2) transform log10x
3) np.randn
How to sample values on a log scale when sampling for learning rate
1) determine upper and lower limit
2) a=log10lowerlimit |b=log10upperlimit
3) random(a,b)
4) lr = 10^random number from 3
What is batch norm
So we know that normalising inputs help speed up training by transforming optimisation problem from more elongated to circular
Batch norm is normalising the inputs to the next layer, eg. a or z where a = g(z) and z=wa + b to speed up trianing
Batch norm also helps with regularisation
Each mini batch is scaled by the mean/variance computed on just that mini-batch
This adds some noise to the values zl within that minibatch so similar to dropout, it adds some noise to each hidden layers activations, adding some reg effect
How to implement batchnorm
Miu=1/m sum z
Variance = 1/m sum (z-miu)^2
Z norm i = zi - miu / sqrt(var+ epsilon)
But sometimes you dw z norm to have mean 0 and variance 1
~Z i = gamma z i norm + beta
Where game and beta are trainable parameters
What activation function dyou use for multi class classification problem for the outputs layer? Explain how it works
Softmax.
CNN
A Convolutional Neural Network, also known as CNN or ConvNet, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data. It contains a series of pixels arranged in a grid-like fashion that contains pixel values to denote how bright and what color each pixel should be.
A CNN typically has three layers: a convolutional layer, a pooling layer, and a fully connected layer.
The convolution layer performs a dot product between two matrices, where one matrix is the set of learnable parameters otherwise known as a kernel, and the other matrix is the restricted portion of the receptive field.
During the forward pass, the kernel slides across the height and width of the image-producing the image representation of that receptive region.
The pooling layer replaces the output of the network at certain locations by deriving a summary statistic of the nearby outputs. This helps in reducing the spatial size of the representation, which decreases the required amount of computation and weights. Default is max pooling
The fully connecred layer helps to map the representation between the input and the output.
Other regularisation techniques
-data augmentation eg for images can flip or crop images to increase data set
- Early stopping plot cost against iterations for both CV dataset and train dataset, stop training when CV cost start to increase.
Why normalise data
Easier to converge to minima when using gradient descent
Minibatch or batch gradient descent for larger training sets? Why?
For training large dataset, use mini batch gradient descent runs much faster than batch gradient descent
Whereas with batch gradient descent, a single pass through the training set allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there.
Diff between gradient descent and stochastic gradient descent
- Because gradient descent uses a whole batch for training, the algo will take nice large steps into the minima but because sgd uses each sample training data as it’s own mini batch, the descent into the minima will be noisier as each sample have diff quality, and hence it also won’t get into minima but rather wander around the minima
What is the con of SGD
lose speed from vectorisation
Mini batch size
- if less than 2000 use batch gradient descent
- Else 64 or 128 or 256 and 512 (2 to the power due to computer memory configurations)
- Also depends on CPU and GPU memory
- Can try diff values, see which one is more efficient
Pros of minibatch
- make progress before processing the whole dataset
What is RMS prop
- known as root mean square prop
- The vertical axis rep b and the horizontal axis rep w.
- The aim is to slow down the learning in the vertical direction ie. b direction and speed up learning in the horizontal direction ie. w.
- Sdw=betasdw + (1-beta)dw^2 (element wise)
- W=w- alphadw/sqrt(sdw)
What is Adam optimisation
- adaptive moment estimation
- combining momentum with RMSProp
Tuning DL networks
List of things to tune
- number of layers
- number of nodes
- LR rate / LR decay rate
- Size of mini batch
- Dropout
Use random search rather than grid search to cover more search area then narrow down the search area
Tuning importance:
Alpha / LR
Momentum term = ~0.8
Hidden units
Mini batch size
Number of layers
LR decay
never tune
Beta1, beta2 and epsilon