NN FINAL Flashcards
Hidden layers are needed if
the data must be separated using a non-linear boundary
Major difference between ANN and Perceptron
the inclusion of hidden layers
Universal Approximation Theorem for Neural Networks
A FFNN with a single hidden layer containing an arbitrary number of neurons can approximate any continuous function
Universal Approximation theorem also proved for
arbitrary number of hidden layers, each containing a limited number of neurons
Hidden layers can represent
arbitrary complex decision boundaries
Deep neural network meaning
referring to the depth of the hidden layers. Typically more than 3-4 hidden layers
FFNN
feed information forwards
Back prop
errors are propagated backwards to correct the weights
Downstream
towards the right
upstream
towards the left
FFNN used for
General neural networks, classification, regression
Convolutional Neural Networks
Excel at image recognition
Recurrent Neural Networks
Excel at language tasks, and predicting next word
Long short-term memory networks
Like RNN, but for tasks that require longer context
Generative-Adverserial networks
Generative neural network is trained on generating something
Adverserial network is then trained on classifying what was generated as whether it was accurate or not
Hidden nodes learn
latent representation (features useful for class boundaries)
First hidden layer captures
simpler features (since it receives the predictors as input)
Subsequent hidden layers hone into
specific patterns of the data to extract features
What does a neuron do?
Exactly same thing we saw perceptron doing input where w transpose x happens, output part is where activation function is run
Activation function is important, as it provides
non-linearity to an ANN and allows it to create non-linear class boundaries
How to choose an activation function at output layer?
Match the activation function at the output layer based on the type of prediction problem
output Activation function for regression
Linear activation function
output Activation function for Binary classification
sigmoid/Logistic activation function
output Activation function for Multiclassification
Softmax activation function
How to choose an activation function at hidden layers
Start with relu activation function and move to others if results are sub-optimal
What does it mean that NN is learning
Updating its weights
What should we initialize the weight vector with?
Random initialization w = N(0, o^2), normal distribution with mean of 0 and standard deviation of sigma squared
How to choose sigma squared?
Xavier initialization
He initialization
Xavier initialization
2/((# of neurons in prev layer) + (number of neurons in next layer))
He initialization
2/(# of neurons in previous layer)
Result of backpropogation
A gradient vector of weights used for updating the weights to get to the minimum gradient
A gradient descent algorithm does
iteratively goes through the training dataset, modifying weights during each pass (epoch) to minimize the cost
Batch gradient descent
Calculate error for each observation and at the end of the training data calculate average error and update w
Stochastic gradient descent
Calculate error after each observation and update w
Mini-batch gradient descent
Split data into small batches, calculate error for each observation in a batch and at the end of the batch calculate average error and update w
Preferred method
Neural networks are almost always
over-parametrized, yet they perform well
over-parameterization leads to
better learning
How much training data do i need for my neural network
no good answer; 10x more observations for training than there are parameters in your neural network
Gradient Descent Algorithm definition
The use of gradients to explore the minima of the error function
Gradient descent relates
Error function minimization
Minimizes the error function with respect to its weights
Weight update formula
New weight = old weight - minus the learning rate multiplied by the gradient of the loss function with respect to the weight