Neural Networks Flashcards
1
Q
Why artificial neural networks?
A
- biological inspiration, human brain
- to reproduce the brain
- goal is understand how it works
- reproduce phenomena and biological data
- understand the e general computational principles used by the brain
- reproduces some of its functions
- focus of ML
2
Q
Different models and learning?
A
- supevised learning
- classification, regression, time series
- unsupervised learning
- clustering, data mining, self-organized maps
- different neural network models
- different computational/learning needs
- network topology
- function computed by a single neuron
- training algorithm
- how training proceeds
3
Q
When to use a Neural Network?
A
- high-dimensional input, discrete/real valued
- discrete/real valued output
- data noisy
- target function unknown
- long learning times acceptable and quick evaluation of learned function
- final solution does not need to be understood by humans
4
Q
Single neuron - Perceptron
A
- weighted sum of inputs and step function
- any Boolean function can be implemented as a combination of
Perceptrons- not single perceptron
5
Q
Perceptron learning algorithm
A
- linearly separable samples in R^n
- algorithm terminates in a finite number of steps
- initialize weights randomly
- n >=0 learning rate
- target -1,1
- (x, t) target
- repeat
- select randomly one of the training samples
- if output = sign(w*x)!=target
- w = w+n(t-o)x
6
Q
Learning rate
A
- step of learning
- small value makes learning more stable
- prevents the weight vector to undergo too ”sharp” changes
7
Q
Is perceptron derivable?
A
- No, the hard threshold is not derivable
* To make perceptron derivable a sigmoid instead of the step function is necessary
8
Q
Multilayer Neural Networks
A
- composed from several connected units
- compute non-linear function
- different types
- input, input variables
- output, output variables
- hidden, codify correlations among input and output variables
- weights define on units’ connections
9
Q
What is the Delta rule?
A
- weight update rule (different from Perceptron rule)
- allows to obtain a best-fit solution approximating the target
- exploits gradient descent to explore the hypothesis space
- minimize error function
- no hard threshold
- start from a random w and update it in the opposite direction of the gradient
10
Q
How does the gradient descent algorithm?
A
- the weights are initialized with random values
- until convergence:
- for each sample in the dataset
- we compute the output feeding the input to the neuron (o=w*x)
- we calculate and accumulate the update for each weight n(t-o)x with respect to the taget
- we update the weights
- for each sample in the dataset
11
Q
Differences between batch, stochastic and mini-batch gradient descent?
A
- batch
- whole dataset
- computationally efficient
- gradient more stable
- performance after long time
- costly memory-wise (all training samples)
- stochastic
- each samples
- immediate report on the performance
- expensive
- gradient could be noisy
- mini-batch
- subset of samples
- man-in-the-middle, tries to keep their advantages while reducing their disadvantages
12
Q
Backprop algorithm for multilayer perceptron
A
- the weights are initialized with random values
- until convergence:
- for each sample in the dataset
- we compute the vectors of hidden and outputs units
- we calculate and accumulate the update for each i-h weight with respect to the target
- we calculate and accumulate the update for each h-o weight with respect to the target
- we update the weights
- for each sample in the dataset
13
Q
Training problems in multi-layer networks
A
- choice of net typology determines the hypothesis space
- number of hidden units determines the complexity of the hypothesis space
- choice of the descent step (learning rate) can be crucial for the convergence
- training is generally slow
- output computation is fast
- lot of local minima could be present
- difficult arriving at a global one
14
Q
How could one try to avoid local minima?
A
- momentum -> term to weight update that imposes a form of inertia on the system
- stochastic training -> noise can help escape local minima
- multiple NN training -> same data, different initializations, most performing one is selected (validation). Or ensemble of NN, prediction is average of individual prediction (weighted)