Neural network fundamental Flashcards
Neural network representation


Why is deep learning taking off only now
- Amounts of data
- Faster computation (SPecialised GPU)
- Algorihtms
- RELU activation function faster than sigmoid (sigmoid has smaller gradients for larger local fields)
Describe Tanh() in terms of sigmoid
Tanh activation function is a shifted version of sigmoid where range is -1 to 1
Tanh activation vs sigmoid activation
- Tanh is superior because the mean of the layer is closer to zero which makes learning for the next layer easier
- Only use sigmoid as activation for outputlayer when doing binary classification (1-0 values for output)
- A downside for both acivation functions is that if z (local field of the neuron) is very large, then the slop/gradient is very small, making gradient descent slow
Softmax function
- Each output is between 0 to 1
- Output layers adds up to 1
- Can be viewed as a vector of probabilities
- Only used in outputlayer for multiclass classification
- combined with cross entropy loss

Softmax activation function:
Visualize Simple neural net with 1 softmax layer
- Multiclass logistic regression
- Linear decision boundaries separating classes

Forward propagation
- For vectorized we can feed the whole dataset X
- Each collumn result in activations layer represents an data input

Vectorized forward propagation


Intution in deep learning
- Earlier layers detecting simpler functions
- then composing these function to form more complex patterns (i.e eyes, mouth)
- Same applies for other domains not just images
- i.e in sound low level functions can be audio/tone, later layers could represent complex forms such as phrases

Circuit theory and deep learning
- Computing xor with one hidden layer is exponetionally large (right network)
- While computing with several layers is uses less neurons (left image)

Optimization based learning

Empirical risk
- An approximation of the true expected loss based on training data.
- Training data consists of a finite set of samples from the true distribution p(x,y) it does not caputre the distribution fully

How to make neural netowork generalize better on unseen data?
- Adjust the loss to add regularization terms
- Split data into training, validation, and test
Optimization based learning

Empirical risk vs generlization

Using training data, to formulate loss

Local optima in neural networks

Problem of plateau

Maximum liklihood estimation
- a method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable.

One hot encoding

When should we use cross-entropy loss function ?
Classification when we use softmax function on the output layer
One hot encoding and the cross entropy loss

Broadcasting
- describes how numpy treats arrays with different shapes during arithmetic operations. the smaller array is “broadcast” across the larger array so that they have compatible shape
- i.e C = A + b where C_{ij}= A_{ij} + b_j
- In other words, the vector b is added to each row of the matrix.
- this shorthand eliminates the need to define a matrix B copied into each row before doing the addition
Batch vs minbatch
Batch means that you use all your data to compute the gradient during one iteration. Mini-batch means you only take a subset of all your data during one iteration.




























