NeuralNets Flashcards
What is the equation of the sigmoid function?
the sigmoid compresses function between [0, 1]
sigma(X) = 1 /1+exp(-X)
where x is a wtd sum (linear combination) wiai
the resulting scaler “activation” measures how positive the weighted sum of that neuron is.
What does the bias term do in a neural network neuron linear equation?
Perhaps we only want a neuron to become active when its weighted sum is greater than some threshold, say a value of 10.
That means we want some BIAS for the NEURON to be INACTIVE. To achieve this additional BIAS, we add -10 to the neuron’s wtd sum (linear combination):
sigma(w1a1 + w2a2 +…+wnan -10)
What is a neuron?
A neuron is a function, with input from previous layer and output to next layer.
In summary, a neuron is a thing that holds a NUMBER.
NN Architecture for MNIST:
Say MNIST digit image is 28x28 pixels. Then a single NN INPUT layer will have 28*28=784 nodes and ea (activated wtd) node represents a particular pixel among 784 pixels.
There will be some user specified number of HIDDEN LAYERS, ea with user specified number of nodes.
There will be 10 OUTPUT layer nodes, one for each possible MNIST outcome 0, 1,…,9.
Ea node represents an ACTIVATION in [0,1] which is a WEIGHTED SUM of all activations FED FORWARD FROM PREVIOUS layers plus a BIAS term.
Provide a high-level description of how a NN actually learns.
Say we have MNIST 28x28 pixels digit images, s.t. there are 28x1 input layer, some arbitrary sized hidden layers, and 10x1 dim output layer.
The hidden layers compose LINEAR COMBINATION WTD SUM of low level COMPONENTS of certain parts of the data. e.g. an upper circle will be composed of small arc in quadrant1, arc in Q2,…,arc in Q4 s.t. some inner node is activated to output an upper circle. These COMBINATIONS of small PARTS are FED FORWARD to deeper layers to BUILD HIGHER LEVEL COMPONENTS (small PARTS ==> larger PATTERNS ==> whole OBJECT).
The LAST HIDDEN LAYER will be responsible for outputting PARTS of digits. e.g. say the penultimate hidden layer nodes output weights in [0,1]:
node0: upper circle node1 : lower circle node2: long vertical line node3: quadrant 1 vertical line ... node9: diag from lower left
Then if the penultimate hidden layer is “activated” with relatively large weights for node0 and node1, then the output layer will have maximal VALUE for node “9” and return “9” as a prediction.
How do NNs learn?
Introduce a COST func, SSR, to check NN at each epoch given a SUPERVISED train set.
Introduce concept of MINIMIZING COST func, can be done with GD.
GD has a gradient, in the case of MNIST with say with some number of hidden layers, then the gradient would be DIRECTION of steepest descent in 1000s dimension.
Thus, a NN LEARNS via GD, where GD UPDATES network PARAMETER weights to IMPROVE TOTAL LOSS.
What does back propagation do in NNs?
Given output layer prediction results, we want the NN to ADJUST its ACTIVATED prediction/output to reduce SSR cost value at an epoch.
Recall a node is an ACTIVATION VALUE resulting from WTD SUM of prev activations FED FWD plus a bias. Then to affect the activation value, we can change activations ai, weights wi, bias b in the node activation func.
Hebbian theory: “Neurons that wire together, fire together”. By this idea, update PREVIOUS activations to IMPROVE mini-batch GD result.
Given supervised target, we KNOW what we WANT to happen in the PENULTIMATE hidden layer. To obtain this DESIRED supervised result, PROPAGATE RELATIVE weight and bias adjustments BACKWARDS so that PENULTIMATE activations NUDGED TOWARD the desired SUPERVISED target.