Neural networks Flashcards
Explain what is a linear sparse coding model and write loss function.
Sparse distributed coding: each entity is encoded by strong activation of a relatively small set of neurons.
Sparse coding: it imposes a penalty on the number of active units to define independency. We select a small number of active functions from a large dictionary, in such way that the basis function code is overcomplete. We define sparsity as having few non-zero components or few components do not close to zero. I.e., given an input vector we would like that only few coefficients are far from zero. This method is optimized iteratively.
Derive delta rule.
The delta rule is the first step to improve the perceptron and consists in replace the activation function with linear activation function, in this case continues and differentiable. Then we can use the gradient descent to minimize the loss function, obtaining the following formulas.
(see the formula)
Note that the perceptron change abruptly the inclination of the line, instead the delta rule changes it slightly and gradually modify the weights until the optimal ones are found.
Weights initializations.
In general, we can initialize the synaptic weights randomly sampling from a Gaussian N(0,1) or an uniform U(-1,1). This is done to break the symmetry: if two hidden units are connected to the same inputs, then their initial parameter should be different otherwise they will converge to the same value. With this method we can initialize larger value to break significantly the symmetry in the network, but due to the activation function saturation we can’t initialize them using too large values. We can improve the weight initialization using an heuristic that improves the SGD convergence.
(see the formula)
where m is the number of inputs (called fan-in) and n is the number of outputs (called fan-ut) of each layer. However, if many units are connected to a single neuron it might led to a small value of the weights. Then we can use the sparse initialization, where each unit has k non-zero weights (parameter decided a priori) and are drawn according to a distribution. In this way we can control the total amount of activation as we increase the number of units (m and n in the previous formula) without decreasing the weights’ value.
Regularization.
The model capacity is limited by imposing a norm penalty in the parameters
(see formula)
where the hyperparameters alpha defines the contribution of the penalty term to the overall loss function. Usually we regularize only the weights. With different types of regularization we can obtain different effects on the network: with L2 norm we change the weight decay (ridge regression, or with L1 norm we can implement the coefficient sparsity (Lasso regression).
Dropout.
The dropout consists in remove input or hidden units during the processing each pattern. The goal is to regularize each units in order to learn not only a feature, but learn a feature that is good in many contexts. The simpler implementation consists in multiply by 0 the output of the activation function according to a binary mask obtained randomly sampled the network structure with a probability p, that is an hyperparameter.
Momentum.
The momentum helps to speed up the convergence of the gradient descent, especially if the Hessian matrix is poorly conditioned (i.e. the loss function has long ellipsoids as level curves). We introduce a velocity term v, that contains both the direction and the speed of the motion in the optimization space.
(see formulas)
It is measured using an exponentially decaying moving average of past gradients and adding it to the current estimate. The term v is multiplied by an hyperparameter alpha, that has to be adjusted. The informally idea is to see the optimization space as a valley, where we go faster if we are moving down.
How Adam works? Write the equations.
Adam is an adaptive learning rates method, that computes individual adaptive learning rate for different parameters from the estimates of first and second moments of the gradients.
(see formulas)
where eta is the learning rate.
Why are second order optimization methods challenging? What is the learning goal? Condition number of the hessian matrix.
The second-order derivatives give more information about the cost function than the one given by the first-order. In fact, if in the second-order there is a negative curvature, then the cost function decreases faster than the gradient predicts, and if there is a positive curvature, then the cost function decreases slower than predicted by the gradient and eventually begins to increase. Instead, if there isn’t any curvature the gradient predicts correctly. We have to compute the Hessian matrix, that is essentially the Jacobian of the gradient.
For the Hessian matrix we can define the condition number as the ratio between its largest eigenvalue and its smallest one. If this ratio is very large, then the GD performs poorly, because the gradient increases rapidly in one direction and slowly in the others one. Then, the learning goal is to guide the SGD or GD using the information in the Hessian matrix.
Kernels in CNN.
The kernel of a CNN is a specific feature encoded by a local receptive field, that is defined for each hidden neuron. However, each kernel is convolved with the entire image like we have multiple copies of the same neuron with the same weights, then we analyse different portions of the image using the same parameters. In this case the output is not a 1 dimensional element but a 2 dimensional one, because each layer produces a filtered version of its input. We can define the kernel stride as the step by which the kernel moves. In order to analyse more accurately a portion of image or don’t decrease the output size we can add a padding of zeros around the input. Different kernels generate different outputs for the same input. If the input has multiple channels we can define a kernel for each one and analyse them separately.
Explain adversarial attacks and defence.
Adversarial attacks consist in inject a small percentage of noise (in some cases not distinguishable from human eyes) in the input in the direction of the gradient that maximizes the loss function. In contrast than the GD, we change the input in order to fool the NN. In fact during the training phase the model track the boundaries for the classification task, and the adversarial samples try to move towards this boundaries and, in some cases, trespassing them in order to correct the actual boundaries to the original data ones. To defend against this type of attacks we can use the adversarial training (optimization via min-max objective during systematically attacks) or the randomized smoothing (smooth the network weights convolving them with a Gaussian noise).
Receptive fields.
The receptive field of a neuron is the size of the region in input that process a feature. It measures the correlation between the output feature and the input region.
Explain back propagation in time.
The recurrent neural networks can be seen, exploiting the time unfolding, as deep network with parameter sharing. Then we can break the sequence of layers in smaller pieces and compute the gradient in these ones rather than compute it in the whole network at the end. In practice, we can process the first time step and create the first hidden activations, then we process the next time step and, using the previous context values, create the second hidden activations, and so on until we reach the last state that encodes the entire sequence. In this way we also avoid the gradient vanishing in the single parts.
LSTM diagram, how gate works, their equation and why you need both sigmoid and tanh.
(see figure)
The LSTM goal is to selectively block, add or remove information in the memory of the network. To do that, we can implement a system of gates that filters the input information:
o The input gate, that decides if the input will flow in the memory cell or it will be dropped; (see formulas)
o The forget gate, that takes in input the previous hidden state and decides if it will be remembered or it will be dropped; (see formulas)
o The output gate, that decides if the current value (input combined with the previous state) will be dropped or it will flow in the next state. (see formulas)
The cell state equation is:
(see formula)
and serves to propagate the information over the time.
The sigmoid is used to decide if the information will be dropped or will flow, instead the tanh decides if add or remove the information in the cell state.
Transformers and attention mechanism.
The attention method consists in an encoder-decoder framework, where the encoder passes at each time step all its hidden states to the decoder and the decoder gives at each one a score that amplifies the most relevant parts. Then at each time step it is generated a context vector: prepare the inputs, give a score to each one, apply a softmax to the input and multiply each result by the correspondent score, sum all weighted input. Then the decoder takes in input this representation. In this way the network can learn the sentence structure giving importance to specific parts.
The transformers use the attention mechanism in a different way. The encoder takes in input and the input, compute an encoding and combine it to a positional encoding (try to incorporate in a finite size vector the positional information), then it applies the multi head attention (attention for multiple parts), then use a feed forward to process all input elements at the same time and, at the end, passes its hidden states to the decoder. Instead, the decoder takes in input the previous output, compute an embedding and combine it with a positional embedding, then apply the muti head attention to it and combine the result with the hidden state passed by the encoder and combine both with a feed forward. The decoder produces sequentially the output and passes to the next state this output “shifted to right”, to embed the passage of the time.
Universal approximation properties of ANN: specifically, cases of RNN and RBM.
The RNN universal approximation properties are:
o For any computable function there exists a finite RNN that computes it.
o Any finite time trajectory of a given n-dimensional dynamical system can be approximately realized by the state of the output units of a continuous time RNN with n output units, some hidden units and appropriate initial condition;
o Any open dynamical system can be approximated with an arbitrary accuracy by a RNN defined in state space model form.
The RBM universal approximation properties are:
o It can be proved that RBM are universal approximators of probability mass function over discrete variables;
o Temporal extensions of RBM are universal approximators of stochastic process with a finite time dependence.