Definitions Flashcards
Perceptron
An artificial neuron or perceptron takes several inputs and performs a weighted summation to produce an output. The weight of the perceptron is determined during the training process and is based on the training data. The sum is then passed through a unit step function. A perceptron can only learn simple functions by learning the weights from examples.
Activation functions
The activation functions make neural nets nonlinear. An activation function decides whether a perceptron should fire or not. During training activation functions play an important role in adjusting the gradients. An activation function such as sigmoid, attenuates the values with higher magnitudes. This nonlinear behaviour of the activation function gives the deep nets the ability to learn complex functions. Most act. fxs are continuous and differential functions, except rectified unit at 0. A continuous function has small changes in output for every small change in input. A differential function has a derivative existing at every point in the domain.
Sigmoid
Can be considered a smoothened step function and hence differentiable. Sigmoid is useful for converting any value to probabilities and can be used for binary classification. The sigmoid maps input to a value in the range of 0 to 1. The change in Y values with respect to X is going to be small, and hence, there will be vanishing gradients. After some learning, the change may be small.
Tanh
Tanh is a scaled version of sigmoid and avoids the problem of a vanishing gradient. The hyperbolic tangent function is also smooth and differentiable. The tanh maps input to a value in the range of -1 to 1. The gradients are more stable than sigmoid and hence have fewer vanishing gradient problems. Both sigmoid and tanh fire all the time, making the ANN really heavy. The ReLU activation function avoids this pitball by not firing at times.
ReLU
The Recitifed Linear Unit can let big numbers pass through. This makes a few neurons stale and they don't fire. This increases the sparsity. The ReLU maps input x to max (0,x) that is, they map negative inputs to 0 and positive inputs are output without any change. Because ReLU don't fire all the time, it can be trained faster. Since the function is simple, it is computationally the least expensive. #forward pass z=np.maximum(0, np.dot(W,x)) #backward pass dW=np.outer(z>0, x) Should a neuron get clamped to zero in the forward pass (z=0, it doesn't fire), then its weights will get a zero gradient. This can lead to the dead ReLU problem. If a ReLU neuron is unfortunately initialized in such a way that it never fires, or if a neuron's weights ever get knocked off with a large update during training, the neuron will remain permanently dead e.g. permanent, irrecoverable brain damange.
Artificial neural network (ANN)
ANN is a collection of perceptrons and activation functions. The perceptrons are connected to form hidden layers or units. The hidden units form the nonlinear basis that maps the input layers to output layers in a lower-dimensional space. ANN is a map from input to output. The map is computed by weighted addition of the inputs with biases.
One-hot encoding
A way to represent the target variables or classes in a classification problem. The target variables can be converted from the string labels to one-hot encoded vectors. A one-hot vector is filled with 1 at the index of the target class but 0 everywhere else. For 1000 classes, one-hot vectors will be of size 1000 integers with all zeros but 1. It makes no assumptions about the similarity of target variables.
Softmax
Softmax is a way of forcing the neural networks to output the sum of 1. Thereby, the output values of the softmax function can be considered as part of a probability distribution. This is useful in multi-class classification problems. Softmax is a kind of activation function with the speciality of output summing to 1. It converts the outputs to probabilities by dividing the output by summation of all the other values. The Euclidean distance can be computed between softmax probabilities and one-hot encoding for optimization. Cross-entropy is a better cost function to optimize.
The softmax function converts its inputs, known as logit or logit scores, to be between 0 and 1, and also normalizes the outputs so they sum up to 1. Turns logits into probabilities.
logit_data=[2.0,1.0,0.1]
logits=tf.placeholder(tf.float32)
softmax=tf.nn.softmax(logits)
with tf.Session() as sess:
output = sess.run(softmax, feed_dict={logits:logit_data})
print(output)
NB: #linear function WX+b
logits=tf.add(tf.matmul(features, weights), biases)
Cross-entropy
Cross-entropy compares the distance between the outputs of softmax and one-hot encoding. Cross-entropy is a loss function for which error has to be minimized. Neural networks estimate the probability of the given data to every class. The probability has to be maximized to the correct target label. Cross-entropy is the summation of negative logarithmic probabilities. Logarithmic value is used for numerical stability. Maximizing a function is equivalent to minimizing the negative of the same function.
Cross-entropy is not symmetric: D(S,L) != D(L,S)
cross_entropy=-tf.reduce_sum(labels*tf.log(prediction), reductoin_indices=1) #training loss loss=tf.reduce_mean(cross_entropy)
Dropout
An effective way of regularizing neural networks to avoid the overfitting of ANN. During training, the dropout layer cripples the neural network by removing hidden units stochastically.
Dropout is also an efficient way of combining several neural networks. For each training case, we randomly select a few hidden units so that we end up with different architectures for each case. This is an extreme case of bagging and model averaging. Dropout layer should not be used during the inference as it is not necessary.
Batch normalization
Batch-norm increases the stability and performance of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1. This reduces overfitting and makes the network train faster.
L1 and L2 regularization
L1 penalises the absolute value of the weight and tends to make the weights zero.
L2 penalizes the squared value of the weight and tends to make the weight smaller during the training.
Both regularizers assume that models with smaller weights are better.
Backpropagation
A backpropagation algorithm is commonly used for training ANNs. The weights are updated from backward based on the error calculated. After calculating the error, gradient descent can be used to calculate the weight updating.
In a nutshell, it consists of:
- doing a feed-forward operation
- Comparing the output of the model with the desired output
- Calculating the error
- Running the feedforward operation backwards (backprop) to spread the error to each of the weights
- Using this to update the weights and get a better model
- Repeat
Gradient descent
The gradient descent algorithm performs multidimensional optimization. The objective is to reach the global minimum. An implementation is SGD. Optimization involves calculating the error value and changing the weights to achieve the minimal error. The direction of finding the minimum is the negative of the gradient of the loss function. The learning rate determines how big each step should be.
Note that the ANN with nonlinear activations will have local minima. SGD works better in practice for optimizing non-convex cost functions.
Convolutional neural network
If we use regular NNs for images, they will be v large in size due to a huge number of neurons –> overfitting, and neurons within the same layer don’t share any connections. An image can be considered a volume with dimensions of height, width and depth (channels). The neurons of a CNN are arranged in a volumetric fashion to take advantage of the volume. Each of the layers transforms the input volume to an output volume. CNN filters encode by transformation. The learned filters detect features/patterns in images. The deeper the layer, the more abstract the pattern is. The learnable parameters in CNN layers are less than the dense layer.