12 Learning From Examples: Neural networks Flashcards
Neural networks: Background
The human brain is a huge network of neurons. A neuron is a basic processing unit that collects, processes and disseminates electrical signals. Early AI tried to imitate the brain by building artificial neural networks (ANN). Met with theoretical limits and “disappeared”. In the 1980-90’s, interest in ANNs resurfaced. Because of new theoretical development and massive industrial interest and applications.
The basic unit of neural networks
The network consists of units (nodes, “neurons”) connected by links:
• Carries an activation a_i from unit i to unit j
• The link from unit i to unit j has a weight W{i,j}
• Bias weight W{0,j} to fixed input a_0 = 1
Activation of a unit j
• Calculate input: in_j = Sum{ W{i,j} a_i } (i = 0…n)
• Derive output: aj = g(in_j) where g is the activation function
Activation functions
Activation function should separate well:
• “Active” (near 1) for desired input
• “Inactive” (near 0) otherwise
It should be non-linear. Most used functions:
Threshold function and Sigmoid function.
Neural network structures
Two main structures:
Feed-forward (acyclic) networks
– Represents a function of its inputs – No internal state
Recurrent network:
– Feeds outputs back to inputs
– May be stable, oscillate or become chaotic
– Output depends on initial state
Recurrent networks are the most interesting and “brain-like”, but also most difficult to understand.
Feed-forward networks as functions
Feed-forward networks as functions
• A FF network calculates a function of its inputs
• The network may contain hidden units/layers
• By changing #layers/units and their weights, different functions can be realized
• FF networks are often used for classification
Perceptrons
Single-layer feed-forward neural networks are called perceptrons, and were the earliest networks to be studied. Perceptrons can only act as linear separators, a small subset of all interesting functions. This partly explains why neural network research was discontinued for a long time.
Perceptron learning algorithm
How to train the network to do a certain function (e.g. classification) based on a training set of input/output pairs?
Basic idea:
• Adjust network link weights to minimize some measure of the error on the training set
• Adjust weights in direction that minimizes error
Performance of perceptrons vs. decision trees
Perceptrons better at learning separable problem. Decision trees better at “restaurant problem”.
Multi-layer feed-forward networks
Adds hidden layers:
• The most common is one extra layer
• The advantage is that more function can be realized, in effect by combining several perceptron functions
It can be shown that:
• A feed-forward network with a single sufficiently large hidden layer can represent any continuous function • With two layers, even discontinuous functions can be represented
However:
• Cannot easily tell which functions a particular network is able to represent
• Not well understood how to choose structure/number of layers for a particular problem
[Feed-forward network with 10 inputs, one output and one hidden layer - suitable for “restaurant problem”.]
More complex activation functions
Multi-layer networks can combine simple (linear separation) perceptron activation functions into more complex functions.
Learning in multi-layer networks
In principle as for perceptrons - adjusting weights to minimize error. The main difference is what “error” at internal nodes mean - nothing to compare to. Solution: Propagate error at output nodes back to hidden layers. Successively propagate backwards if the network has several hidden layers. The resulting Back-propagation algorithm is the standard learning method for neural networks.
Learning neural network structure
Learning neural network structure
Need to learn network structure
• Learning algorithms have assumed fixed network structure
• However, we do not know in advance what structure will be necessary and sufficient
Solution approach:
• Try different configurations, keep the best
• Search space is very large (# layers and # nodes)
• “Optimal brain damage”: Start with full network, remove nodes selectively (optimally)
• “Tiling”: Start with minimal network that covers subset of training set, expand incrementally