Basic Of Learning Flashcards
Perceptron
Perceptron, i.e. a FFNN composed of MCP neurons with step or signum activation function
and threshold T
What is the correspondent of biological stimulus for a perceptrone
The input pattern
For what task a perceptron can be used?
Classification of patterns
How is the threshold managed together with the weights?
The threshold is considered as an additional weight of the neuron with a virtual constant input equal to -1
Decision boundary: what is and orientation
Place of the values for which the action potential is 0. It is orthogonal to the weight vector
Rosenblatt perceptron learning rule
incremental procedure, we start from an initial weight vector
1)the weight vector is iteratively updated using online strategy
2)each pattern k in the training set contributes to the weight increment vector by means of the error signal
3) one iteration of the iterative procedure requires the evaluation of all R patterns
FORMULA
When does the rosenblatt perceptron learning rule correct the weight vector?
If and only if a misclassification occurs
What kind of problems can the perceptron solve?
Only the linearly separable ones: it must exist the hyper-plane that completely separates the patterns in between
How to extend basic perceptron?
- continuous output
- non linear continuous activation function
- Smooth transition near to 0
How to pass from continuous output to categorical?
Softmax network (or manual thresholding in simple cases)
Error function with c1 activation functions
FORMULA
Why using square in error function?
It makes the error positive and penalizes large errors more
Gradient descent
It is an optimization algorithm that approaches a local minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point
what is the learning rate?
It modulates the amplitude of the gradient vector in gradient descent
Delta rule update formula
dw_ij=(t_i-u_i)f’(P_i)x_j
Derivative of logsig
f(1-f)
derivative of tanh
1-f^2
What does it mean to “reinforce the learning”?
Feeding the network several times with the same training set of patterns
Problem of the local minima, how to solve
If we start nearby a local minimum we might end up there instead than the global minimum. It depend on the starting guess so starting with a range of different initial weights sets increases our chances of finding the global minimum
What is an important problem of linear activation functions ( a part from linearity)?
Suitable for continuous output but may lead to parameter unbound
How to initialize weights? Why?
Randomly, in a small range about zero because sigmoid function can easily saturate for great values of the weights
Type of weight update?
On-line, BATCH, mini-batch
on-line updating
each pattern error contributes sequentially to the weight updating.
The search in the weight is more stochastic avoiding local minima
batch updating
implies that all the pattern errors are cumulated before updating the other weights.
Small pattern errors can be smoothed out so it is less sensitive
mini-batch updating
Use of a subset S of the overall training dataset
possible stop criteria
1)maximum number of iterations
2)euclidean norm of the gradient vector less than a predefined threshold
3)error function less than a predefined threshold
4)hybrid criterion
Regularization factor
norm(w)^2: keeps weights small as much as possible. scaled by a regularization rate and summed to the error
heuristic rules for training data
1)training data should be representative for the target task
2)avoiding many examples of one type at the expense of another
3)if one class of pattern is easy to learn, having large numbers of patterns from that class in the training set will only slow down the over-all learning process
4)rescale input values(zero mean and std normalization)
How to prevent under-fitting
the network must have a sufficient number of hidden units. Use convergence threshold
How to prevent over-fitting
Avoid too much layers and units
Additional noise superimposed to the training patterns
The training can be stopped before convergence