Neural network fundamental Flashcards
Neural network representation
Why is deep learning taking off only now
- Amounts of data
- Faster computation (SPecialised GPU)
- Algorihtms
- RELU activation function faster than sigmoid (sigmoid has smaller gradients for larger local fields)
Describe Tanh() in terms of sigmoid
Tanh activation function is a shifted version of sigmoid where range is -1 to 1
Tanh activation vs sigmoid activation
- Tanh is superior because the mean of the layer is closer to zero which makes learning for the next layer easier
- Only use sigmoid as activation for outputlayer when doing binary classification (1-0 values for output)
- A downside for both acivation functions is that if z (local field of the neuron) is very large, then the slop/gradient is very small, making gradient descent slow
Softmax function
- Each output is between 0 to 1
- Output layers adds up to 1
- Can be viewed as a vector of probabilities
- Only used in outputlayer for multiclass classification
- combined with cross entropy loss
Softmax activation function:
Visualize Simple neural net with 1 softmax layer
- Multiclass logistic regression
- Linear decision boundaries separating classes
Forward propagation
- For vectorized we can feed the whole dataset X
- Each collumn result in activations layer represents an data input
Vectorized forward propagation
Intution in deep learning
- Earlier layers detecting simpler functions
- then composing these function to form more complex patterns (i.e eyes, mouth)
- Same applies for other domains not just images
- i.e in sound low level functions can be audio/tone, later layers could represent complex forms such as phrases
Circuit theory and deep learning
- Computing xor with one hidden layer is exponetionally large (right network)
- While computing with several layers is uses less neurons (left image)
Optimization based learning
Empirical risk
- An approximation of the true expected loss based on training data.
- Training data consists of a finite set of samples from the true distribution p(x,y) it does not caputre the distribution fully
How to make neural netowork generalize better on unseen data?
- Adjust the loss to add regularization terms
- Split data into training, validation, and test
Optimization based learning
Empirical risk vs generlization
Using training data, to formulate loss
Local optima in neural networks
Problem of plateau
Maximum liklihood estimation
- a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.
One hot encoding
When should we use cross-entropy loss function ?
Classification when we use softmax function on the output layer
One hot encoding and the cross entropy loss