Week 2 Flashcards
Defining characteristic of DNN
More than 1 hidden layer
Advantages of SGD
Efficient for large sample
Implementable numerically
Can be ‘controlled’
RNN
Recurrent Neural networks sequentially feed output back into network
Connection from FNN to RNN
RNNs can be reduced to FNNs by UNFOLDING them
First NN
McCulloch and Pitts
Perceptron & problems of
Rosenblatt ‘58
What started AI winter
‘69 Minsky showed XOR couldn’t be replicated by perceptron
FULLY CONNECTED
If all entries of each L_i in NN are non zero
Universal approximation property
Let g:R -> R be a measurable function such that:
a) g is not a polynomial function
Define FNN
Differences between (hyper)params
Hyper:
Set by hand
Features
Params:
Chosen by machine (weights and biases)
Optimised by SGD
Architecture of network
Hyperparameters and Activatuon Functions (things chosen by you)
Dense layer
Entire layer is connected (non zero)
Number of parameters that characterise N
Adding units
Continuity and differentiability of NN
Bottom line:
If every activation function is continuous, then so to is the NN
The NN is overall as differentiably continuous as the LEAST differentiable activation function
One dimensional activation functions (entire table)
Dead ReLU problem & solution
A layer of ReLU activations receives only negative values -> producing constant output
This can freeze gradient based algorithms
Therefore use leaky ReLU or Parametric ReLU (or ELU)
Usually 0 < α < 1
Multi dimensional activation
When is identity function useful
Output layer
Limitation of Heaviside
As is non continuous, can’t be used in gradient based algorithms
Saturating activation functions
Output is bounded
Sigmoid, tanh
Continually differentiable counterpart of ReLU
Soft plus
Boltzmann dist
From stat physics
Analogous to Multinomial logistic regression
Which is Standard activation function in image recognition
Motivation for max out
Several, simple, convex non linear functions can be expressed as maxima of affine functions eg
ReLU(x) = max{0, x}
|x| = max{-x, x}
Sup Norm
Lp norm
Def universal approximation property
Limitation of UAP
Result is non constructive:
It does not tell what the approximating NNs f and h look like, just that they exist
Also it is non quantitative:
It doesn’t tell how many hidden units are required to create these networks