Neural Networks and Deep Learning Flashcards
Neuron, activation function, network achitecture, point of view of one node, hypothesis set, matrix notation
Neuron: function x –> sigma(<v,x>)
Activation function: sigma: R -> R
examples:
- sign
-treshold
-sigmoid
Network architecture: (V,E, sigma)
vertices, edges, functino
Point of view of one node
Hypothesis set: H(V,E, sigma) = {hV,E,sigma,w : w is a mapping from E to R}
w are the weights
Matrix notation
General construction of NN for a given Boolean formula
Let’s take an arbitrary function f: {-1,1} –> {-1,1}
Goal: build a NN that corresponds to f ( if the input is x, then the prediction of such NN is f(x))
- consider x such that f(x) = 1 : for each such x there is a neuron in the only hidden layer that corresponds to x. The neuron implements:
gi(x) ) = sign(<x,x’> -d+19
output node: “implements” h(x’) = sign (SUM gi(x) +k-1)
where k = # of vectors x such that f(x) = 1
Expressiveness of NNs
every Boolean function can be implemented using a neural network of depth 2. NNs are universal approximatros.
Sample complexity, runtime of learning NNs
Sample complexity: quantity of data needed to learn with NN
- VC-dim of HV,E,sign = O(|E|log|E|)
- VC-dim of HV,E,sigmoid = O(|V|^2log|E|^2)
Large NNs require a lot of data
Runtime of Learning: applying the ERM rule with respect to HV,E,sign is NP hard
So we train NN using Stochastic Gradient Descent
Forward propagation algorithm
PSEUDOCODE
SGD and Backpropagation algorithm (pseudocode: only structure)
Based on SGD
PSEUDOCODE
Regularized NNs
Instead of training a NN by minimizing Ls(h), find h that minimizes
Ls(h) + lambda/2 SUM/w(t))^2
where lambda is the regularization parameter
We find h by SGD or improved algorithms.
This is caalled squared weight decay regularizer