Neural network fundamental Flashcards
Neural network representation


Why is deep learning taking off only now
- Amounts of data
- Faster computation (SPecialised GPU)
- Algorihtms
- RELU activation function faster than sigmoid (sigmoid has smaller gradients for larger local fields)
Describe Tanh() in terms of sigmoid
Tanh activation function is a shifted version of sigmoid where range is -1 to 1
Tanh activation vs sigmoid activation
- Tanh is superior because the mean of the layer is closer to zero which makes learning for the next layer easier
- Only use sigmoid as activation for outputlayer when doing binary classification (1-0 values for output)
- A downside for both acivation functions is that if z (local field of the neuron) is very large, then the slop/gradient is very small, making gradient descent slow
Softmax function
- Each output is between 0 to 1
- Output layers adds up to 1
- Can be viewed as a vector of probabilities
- Only used in outputlayer for multiclass classification
- combined with cross entropy loss

Softmax activation function:
Visualize Simple neural net with 1 softmax layer
- Multiclass logistic regression
- Linear decision boundaries separating classes

Forward propagation
- For vectorized we can feed the whole dataset X
- Each collumn result in activations layer represents an data input

Vectorized forward propagation


Intution in deep learning
- Earlier layers detecting simpler functions
- then composing these function to form more complex patterns (i.e eyes, mouth)
- Same applies for other domains not just images
- i.e in sound low level functions can be audio/tone, later layers could represent complex forms such as phrases

Circuit theory and deep learning
- Computing xor with one hidden layer is exponetionally large (right network)
- While computing with several layers is uses less neurons (left image)

Optimization based learning

Empirical risk
- An approximation of the true expected loss based on training data.
- Training data consists of a finite set of samples from the true distribution p(x,y) it does not caputre the distribution fully

How to make neural netowork generalize better on unseen data?
- Adjust the loss to add regularization terms
- Split data into training, validation, and test
Optimization based learning

Empirical risk vs generlization

Using training data, to formulate loss

Local optima in neural networks

Problem of plateau

Maximum liklihood estimation
- a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.

One hot encoding

When should we use cross-entropy loss function ?
Classification when we use softmax function on the output layer
One hot encoding and the cross entropy loss

Broadcasting
- describes how numpy treats arrays with different shapes during arithmetic operations. the smaller array is “broadcast” across the larger array so that they have compatible shape
- i.e C = A + b where C_{ij}= A_{ij} + b_j
- In other words, the vector b is added to each row of the matrix.
- this shorthand eliminates the need to define a matrix B copied into each row before doing the addition
Batch vs minbatch
Batch means that you use all your data to compute the gradient during one iteration. Mini-batch means you only take a subset of all your data during one iteration.
Epoch
- Single pass through the whole training set
- We can divide a dataset of 2000 examples into batches of 500 then it will take 4 iterations to complete 1 epoch. Where Batch Size is 500 and Iterations is 4, for 1 complete epoch.
- The number of batches is equal to number of iterations for one epoch.
How does the training progress look like for Batch training and min-batch training

- The loss should always go down for batch training
- If it ever goes up then something is wrong (i.e learning rate is too high)
- Minbatch training is slightly noiser, but the overall trend is that the loss is reducing with amount of iterations

Stochastic gradient descent
- Each data sample is its own batch (size of batch = 1)
- Very noisy, but
- Won’t ever completley converge to the minimum, but jump around it
- Loosing speed up from vectorization
Minbatch-gradient descent
- Train on a subset of data (minbatch) rather than the whole training data
- Benefites (Works best in practice as to batch training or stochastic)
- See progress for gradient descent without processing the entire dataset
- Converge quicker because each gradient step is smaller
- Less likley to converge to local minima as opposed to batch
- Takes less main memory
- Randomize the samples in each minibatch each epoch
Batch training
- Use the whole dataset during during each step of gradient descent
- Slower than minbatch or stochastic
- However, more accurate and precise progress
Visually explain
- Batch training
- Minbatch training
- Stochastic gradient descent

Gradient descent with momentum. Describe the “momentum”
- Compute an exponentially weighted average of the gradients
- Each gradient descent step depend on previous steps
On iteration t compute dW, db on the current minbatch
vdW = ßvdW + (1-ß)dW
vdb = ßvdb + (1-ß)dW
W = W - ΠvdW , b = b - Πvdb
- The second line approximates a gradient descent step, where VdW is an approximation to the gradient.
- The vector VdW is an exponentially weighted average of the last gradients dW, which reduces the oscillations in dW
Hyperparameters: learning rate Π, and exponentially weighted average ß
ß = 0.9
Average over the last 10 gradients
ß = 0.5
Average over the last 2 gradients

RMS prop
- Exponentially weighted of the squares
- Allows to select a larger learning rate

Gradient descent with Adam. Describe Adam
- Adaptive moment estimation
- Combines momentum and RMSprop
- ß1
- Called the first moment exponentially weighted averge
- 0.9
- ß2
- Second moment (exponentially weighted averag eof squares)
- 0.99
- Σ = 10-8 Used in order to not devide by zero

Learning rate decay
- Initial steps of learning we can have larger learning rate
- After a while, gradient descent will jump around a minimum, then we want to select a smaller learning rate
- Where Π0 initial learning rate
Π = 1/(1+ decayRate* epochsRun)Π0
Π = 0.95epochRun* Π0 (Exponentially decay)
Π = Divide by 2 each epoch (Discrete staircase)
- Decay rate becomes another hyper parameter
- Manual decay
- Manually decrease learning rate after gradient descent has been running
- Only works if training small amount of models

Train/dev/test set
- Split data into different parts train/dev/test
- This split is used to tune the hyper parameters (i.e learning rate) of a model
- Hyperparameters in a neural network
- # Layers
- # Neurons
- Activations function
- Learning rate
- Momentum
- Hyperparameters in a neural network
- If your data is small than use classical proportions 70/20/10
- If your data is big, then use modern (big data era) proportions 98/1/1
Bias/variance
- Bias: Performance on the training data compared to optimal performance
- Variance: difference between loss on training and validation data
- High variance
- Overfitting
- Example (assume human error ≈ 0)
- Train error: 1%
- Dev error: 11%
- High bias
- Underfitting
- Example (assume human error ≈ 0)
- Train error: 15%
- Dev error:16%
- Low bias/low variance
- Best case
- Train error matches dev error
- Example (assume human error ≈ 0)
- Train error: 0.5%
- Dev error: 1%
- High bias/High variance
- Worst case
- Model very has high parameters and is flexible but mistrains on samples
- Example (assume human error ≈ 0 in cat/dog pictures)
- Train error: 15%
- Dev error: 30%

Basic recipe for NN
- If High bias issue, then getting more data is not going to help
- Less bias/variance tradeoff issue in NN

Regularization
- Tuned as a hyper parameter
- L1 - regularization
- Weight vector w becomes sparse (contain many zeros)
- L2 - regularization
- Euclidearn norm

Show how regularization “decays the weight”

Normalizing data
- Subtract each sample by zero
mu = 1/m*( Σmi=1 xi )
x: = x - mu
* Divide each sample by
mu = 1/m*( Σmi=1 xi **2 ) (**2 means elementwise squaring)
Get mu and sigma from training and use these same values to also normalize test data.
- Why normalize
- The scale in each input feature might differ drastically
- The difference scaling in each features lead that gradient descent steps oscillate
- More cerical controus leads to that gradient descent needs less steps to converge

Batch normalization
- Makes hyperparameter search easier
- Makes network much more robust to choice of paramters
- Much bigger range of hyperparameters that work well
- Apply same process of normalization of inputs for all z values (inputs to neurons) at each layer
- Later layers are more robust to changes in earlier layers
- Allows each layer to learn independently from other layers
- Reduces covariate shift
- If distribution of X changes then have to re-learn training algorithm, This is true even if the ground true mapping X=>y remains unchanged
- Batch norm at test time
- Come up with a separate estimate for mean and deviation during training and not on the test set
- i.e exponentially weighted average or any other average methods on minbatches

Explain informally how regularization reduces variance

Give reasons why Image classification is so hard
- Hidden parts
- Deformable objects
- Viewing objects from an werid angle
Define Gradeint descent, and show how to update the weights
- Computes the gradient of the loss and updates the weights in the opposite direction of the gradient.

Gradeint descent. How does each step change cost function

Forward mode differentiation

Backward mode- differentiation
- Also known as backprop







Exponentially weighted average
.

Bias correction
- Associated with exponentially weighted average

Backpropagation
In back-propagation, the gradient of the loss is computed with respect to all variables in the function such that parameters can be updated using gradient descent.

- The first is not correct, the number of paths would grow exponentially
- The second one is correct
Adam optimization

First and fourth is correct
Does the runtime for performing Backpropagation depend on the amount of samples
Yes. We have to take the gradeint of each term separately

Quickly