Definitions Flashcards
Perceptron
An artificial neuron or perceptron takes several inputs and performs a weighted summation to produce an output. The weight of the perceptron is determined during the training process and is based on the training data. The sum is then passed through a unit step function. A perceptron can only learn simple functions by learning the weights from examples.
Activation functions
The activation functions make neural nets nonlinear. An activation function decides whether a perceptron should fire or not. During training activation functions play an important role in adjusting the gradients. An activation function such as sigmoid, attenuates the values with higher magnitudes. This nonlinear behaviour of the activation function gives the deep nets the ability to learn complex functions. Most act. fxs are continuous and differential functions, except rectified unit at 0. A continuous function has small changes in output for every small change in input. A differential function has a derivative existing at every point in the domain.
Sigmoid
Can be considered a smoothened step function and hence differentiable. Sigmoid is useful for converting any value to probabilities and can be used for binary classification. The sigmoid maps input to a value in the range of 0 to 1. The change in Y values with respect to X is going to be small, and hence, there will be vanishing gradients. After some learning, the change may be small.
Tanh
Tanh is a scaled version of sigmoid and avoids the problem of a vanishing gradient. The hyperbolic tangent function is also smooth and differentiable. The tanh maps input to a value in the range of -1 to 1. The gradients are more stable than sigmoid and hence have fewer vanishing gradient problems. Both sigmoid and tanh fire all the time, making the ANN really heavy. The ReLU activation function avoids this pitball by not firing at times.
ReLU
The Recitifed Linear Unit can let big numbers pass through. This makes a few neurons stale and they don't fire. This increases the sparsity. The ReLU maps input x to max (0,x) that is, they map negative inputs to 0 and positive inputs are output without any change. Because ReLU don't fire all the time, it can be trained faster. Since the function is simple, it is computationally the least expensive. #forward pass z=np.maximum(0, np.dot(W,x)) #backward pass dW=np.outer(z>0, x) Should a neuron get clamped to zero in the forward pass (z=0, it doesn't fire), then its weights will get a zero gradient. This can lead to the dead ReLU problem. If a ReLU neuron is unfortunately initialized in such a way that it never fires, or if a neuron's weights ever get knocked off with a large update during training, the neuron will remain permanently dead e.g. permanent, irrecoverable brain damange.
Artificial neural network (ANN)
ANN is a collection of perceptrons and activation functions. The perceptrons are connected to form hidden layers or units. The hidden units form the nonlinear basis that maps the input layers to output layers in a lower-dimensional space. ANN is a map from input to output. The map is computed by weighted addition of the inputs with biases.
One-hot encoding
A way to represent the target variables or classes in a classification problem. The target variables can be converted from the string labels to one-hot encoded vectors. A one-hot vector is filled with 1 at the index of the target class but 0 everywhere else. For 1000 classes, one-hot vectors will be of size 1000 integers with all zeros but 1. It makes no assumptions about the similarity of target variables.
Softmax
Softmax is a way of forcing the neural networks to output the sum of 1. Thereby, the output values of the softmax function can be considered as part of a probability distribution. This is useful in multi-class classification problems. Softmax is a kind of activation function with the speciality of output summing to 1. It converts the outputs to probabilities by dividing the output by summation of all the other values. The Euclidean distance can be computed between softmax probabilities and one-hot encoding for optimization. Cross-entropy is a better cost function to optimize.
The softmax function converts its inputs, known as logit or logit scores, to be between 0 and 1, and also normalizes the outputs so they sum up to 1. Turns logits into probabilities.
logit_data=[2.0,1.0,0.1]
logits=tf.placeholder(tf.float32)
softmax=tf.nn.softmax(logits)
with tf.Session() as sess:
output = sess.run(softmax, feed_dict={logits:logit_data})
print(output)
NB: #linear function WX+b
logits=tf.add(tf.matmul(features, weights), biases)
Cross-entropy
Cross-entropy compares the distance between the outputs of softmax and one-hot encoding. Cross-entropy is a loss function for which error has to be minimized. Neural networks estimate the probability of the given data to every class. The probability has to be maximized to the correct target label. Cross-entropy is the summation of negative logarithmic probabilities. Logarithmic value is used for numerical stability. Maximizing a function is equivalent to minimizing the negative of the same function.
Cross-entropy is not symmetric: D(S,L) != D(L,S)
cross_entropy=-tf.reduce_sum(labels*tf.log(prediction), reductoin_indices=1) #training loss loss=tf.reduce_mean(cross_entropy)
Dropout
An effective way of regularizing neural networks to avoid the overfitting of ANN. During training, the dropout layer cripples the neural network by removing hidden units stochastically.
Dropout is also an efficient way of combining several neural networks. For each training case, we randomly select a few hidden units so that we end up with different architectures for each case. This is an extreme case of bagging and model averaging. Dropout layer should not be used during the inference as it is not necessary.
Batch normalization
Batch-norm increases the stability and performance of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1. This reduces overfitting and makes the network train faster.
L1 and L2 regularization
L1 penalises the absolute value of the weight and tends to make the weights zero.
L2 penalizes the squared value of the weight and tends to make the weight smaller during the training.
Both regularizers assume that models with smaller weights are better.
Backpropagation
A backpropagation algorithm is commonly used for training ANNs. The weights are updated from backward based on the error calculated. After calculating the error, gradient descent can be used to calculate the weight updating.
In a nutshell, it consists of:
- doing a feed-forward operation
- Comparing the output of the model with the desired output
- Calculating the error
- Running the feedforward operation backwards (backprop) to spread the error to each of the weights
- Using this to update the weights and get a better model
- Repeat
Gradient descent
The gradient descent algorithm performs multidimensional optimization. The objective is to reach the global minimum. An implementation is SGD. Optimization involves calculating the error value and changing the weights to achieve the minimal error. The direction of finding the minimum is the negative of the gradient of the loss function. The learning rate determines how big each step should be.
Note that the ANN with nonlinear activations will have local minima. SGD works better in practice for optimizing non-convex cost functions.
Convolutional neural network
If we use regular NNs for images, they will be v large in size due to a huge number of neurons –> overfitting, and neurons within the same layer don’t share any connections. An image can be considered a volume with dimensions of height, width and depth (channels). The neurons of a CNN are arranged in a volumetric fashion to take advantage of the volume. Each of the layers transforms the input volume to an output volume. CNN filters encode by transformation. The learned filters detect features/patterns in images. The deeper the layer, the more abstract the pattern is. The learnable parameters in CNN layers are less than the dense layer.
Kernel
Kernel is the parameter convolution layer used to convolve the image. The kernel has two parameters, the stride and size. The size can be any dimension of a rectangle. Stride is the number of pixels moved every time. A stride of length 1 produces an image of almost the same size, and a stride of 2 produces half the size. Padding the image will help achieve the same size of the input.
Max Pooling
Pooling layers are placed between convolution layers. Pooling layers reduce the size of the images across layers by sampling. The sampling is done by selecting the maximum value in a window. Average pooling averages over the window. Pooling also acts as a regularization technique to avoid overfitting. Pooling is carried out on all the channels of features. Pooling can also be performed with various strides.
Recurrent neural networks (RNN)
RNNs can model sequential information. They do not assume that the data points are intensive. The perform the same task from the output of the previous data of a series of sequence data. This can also be thought of as memory. RNNs cannot remember from longer sequences or time. During backprop, the gradients can vanish over time. To overcome this problem, LSTM can be used to remember over a longer time period.
Long short-term memory (LSTM)
LSTM can store info for longer periods of time and it is efficient in capturing long-term dependencies. LTSM has several gates: forget, input and output. Forget gate maintains the information of the previous state. The input gate updates the current state using the input. The output gate decides the information to be passed to the next state. The ability to forget and retain only the important tings enables LSTM to remember over a longer time period.
Similarity learning
The process of learning how two images are similar. A score can be computed between two images based on the semantic meaning.
Detection or localization and segmentation
Detection or localization is a task that finds an object in an image and localizes the object with a bounding box.
Segmentation is the task of pixel-wise classification. This gives a fine separation of objects.
Classification
Image classification is the task of labelling the whole image with an object/concept with confidence.
Data augmentation
Data augmentation gives ways to increase the size of the dataset. Data augmentation introduces noise during training, producing robustness in the model to various inputs. Useful when dataset is small. Techniques include flipping, random cropping, shearing, zooming, rotation, whitening (done by PCA that preserves only the important data), normalization (normalizes the pixels by standardizing the mean and variance), channel shifting (colour channels are shifted to make the model robust to colour changes caused by various artifacts).
Transfer Learning
Transfer learning is the process of learning from a pre-trained model that was trained on a larger dataset. Training a model with random initialization often takes time and energy to get the result. Initializing the model with a pre-trained model gives faster convergence, saving time and energy.