Definitions Flashcards
An artificial neuron or perceptron takes several inputs and performs a weighted summation to produce an output. The weight of the perceptron is determined during the training process and is based on the training data. The sum is then passed through a unit step function. A perceptron can only learn simple functions by learning the weights from examples.
Activation functions
The activation functions make neural nets nonlinear. An activation function decides whether a perceptron should fire or not. During training activation functions play an important role in adjusting the gradients. An activation function such as sigmoid, attenuates the values with higher magnitudes. This nonlinear behaviour of the activation function gives the deep nets the ability to learn complex functions. Most act. fxs are continuous and differential functions, except rectified unit at 0. A continuous function has small changes in output for every small change in input. A differential function has a derivative existing at every point in the domain.
Can be considered a smoothened step function and hence differentiable. Sigmoid is useful for converting any value to probabilities and can be used for binary classification. The sigmoid maps input to a value in the range of 0 to 1. The change in Y values with respect to X is going to be small, and hence, there will be vanishing gradients. After some learning, the change may be small.
Tanh is a scaled version of sigmoid and avoids the problem of a vanishing gradient. The hyperbolic tangent function is also smooth and differentiable. The tanh maps input to a value in the range of -1 to 1. The gradients are more stable than sigmoid and hence have fewer vanishing gradient problems. Both sigmoid and tanh fire all the time, making the ANN really heavy. The ReLU activation function avoids this pitball by not firing at times.
The Recitifed Linear Unit can let big numbers pass through. This makes a few neurons stale and they don't fire. This increases the sparsity. The ReLU maps input x to max (0,x) that is, they map negative inputs to 0 and positive inputs are output without any change. Because ReLU don't fire all the time, it can be trained faster. Since the function is simple, it is computationally the least expensive. #forward pass z=np.maximum(0,,x)) #backward pass dW=np.outer(z>0, x) Should a neuron get clamped to zero in the forward pass (z=0, it doesn't fire), then its weights will get a zero gradient. This can lead to the dead ReLU problem. If a ReLU neuron is unfortunately initialized in such a way that it never fires, or if a neuron's weights ever get knocked off with a large update during training, the neuron will remain permanently dead e.g. permanent, irrecoverable brain damange.
Artificial neural network (ANN)
ANN is a collection of perceptrons and activation functions. The perceptrons are connected to form hidden layers or units. The hidden units form the nonlinear basis that maps the input layers to output layers in a lower-dimensional space. ANN is a map from input to output. The map is computed by weighted addition of the inputs with biases.
One-hot encoding
A way to represent the target variables or classes in a classification problem. The target variables can be converted from the string labels to one-hot encoded vectors. A one-hot vector is filled with 1 at the index of the target class but 0 everywhere else. For 1000 classes, one-hot vectors will be of size 1000 integers with all zeros but 1. It makes no assumptions about the similarity of target variables.
Softmax is a way of forcing the neural networks to output the sum of 1. Thereby, the output values of the softmax function can be considered as part of a probability distribution. This is useful in multi-class classification problems. Softmax is a kind of activation function with the speciality of output summing to 1. It converts the outputs to probabilities by dividing the output by summation of all the other values. The Euclidean distance can be computed between softmax probabilities and one-hot encoding for optimization. Cross-entropy is a better cost function to optimize.
The softmax function converts its inputs, known as logit or logit scores, to be between 0 and 1, and also normalizes the outputs so they sum up to 1. Turns logits into probabilities.
with tf.Session() as sess:
output =, feed_dict={logits:logit_data})
NB: #linear function WX+b
logits=tf.add(tf.matmul(features, weights), biases)
Cross-entropy compares the distance between the outputs of softmax and one-hot encoding. Cross-entropy is a loss function for which error has to be minimized. Neural networks estimate the probability of the given data to every class. The probability has to be maximized to the correct target label. Cross-entropy is the summation of negative logarithmic probabilities. Logarithmic value is used for numerical stability. Maximizing a function is equivalent to minimizing the negative of the same function.
Cross-entropy is not symmetric: D(S,L) != D(L,S)
cross_entropy=-tf.reduce_sum(labels*tf.log(prediction), reductoin_indices=1) #training loss loss=tf.reduce_mean(cross_entropy)
An effective way of regularizing neural networks to avoid the overfitting of ANN. During training, the dropout layer cripples the neural network by removing hidden units stochastically.
Dropout is also an efficient way of combining several neural networks. For each training case, we randomly select a few hidden units so that we end up with different architectures for each case. This is an extreme case of bagging and model averaging. Dropout layer should not be used during the inference as it is not necessary.
Batch normalization
Batch-norm increases the stability and performance of neural network training. It normalizes the output from a layer with zero mean and a standard deviation of 1. This reduces overfitting and makes the network train faster.
L1 and L2 regularization
L1 penalises the absolute value of the weight and tends to make the weights zero.
L2 penalizes the squared value of the weight and tends to make the weight smaller during the training.
Both regularizers assume that models with smaller weights are better.
A backpropagation algorithm is commonly used for training ANNs. The weights are updated from backward based on the error calculated. After calculating the error, gradient descent can be used to calculate the weight updating.
In a nutshell, it consists of:
- doing a feed-forward operation
- Comparing the output of the model with the desired output
- Calculating the error
- Running the feedforward operation backwards (backprop) to spread the error to each of the weights
- Using this to update the weights and get a better model
- Repeat
Gradient descent
The gradient descent algorithm performs multidimensional optimization. The objective is to reach the global minimum. An implementation is SGD. Optimization involves calculating the error value and changing the weights to achieve the minimal error. The direction of finding the minimum is the negative of the gradient of the loss function. The learning rate determines how big each step should be.
Note that the ANN with nonlinear activations will have local minima. SGD works better in practice for optimizing non-convex cost functions.
Convolutional neural network
If we use regular NNs for images, they will be v large in size due to a huge number of neurons –> overfitting, and neurons within the same layer don’t share any connections. An image can be considered a volume with dimensions of height, width and depth (channels). The neurons of a CNN are arranged in a volumetric fashion to take advantage of the volume. Each of the layers transforms the input volume to an output volume. CNN filters encode by transformation. The learned filters detect features/patterns in images. The deeper the layer, the more abstract the pattern is. The learnable parameters in CNN layers are less than the dense layer.
Kernel is the parameter convolution layer used to convolve the image. The kernel has two parameters, the stride and size. The size can be any dimension of a rectangle. Stride is the number of pixels moved every time. A stride of length 1 produces an image of almost the same size, and a stride of 2 produces half the size. Padding the image will help achieve the same size of the input.
Max Pooling
Pooling layers are placed between convolution layers. Pooling layers reduce the size of the images across layers by sampling. The sampling is done by selecting the maximum value in a window. Average pooling averages over the window. Pooling also acts as a regularization technique to avoid overfitting. Pooling is carried out on all the channels of features. Pooling can also be performed with various strides.
Recurrent neural networks (RNN)
RNNs can model sequential information. They do not assume that the data points are intensive. The perform the same task from the output of the previous data of a series of sequence data. This can also be thought of as memory. RNNs cannot remember from longer sequences or time. During backprop, the gradients can vanish over time. To overcome this problem, LSTM can be used to remember over a longer time period.
Long short-term memory (LSTM)
LSTM can store info for longer periods of time and it is efficient in capturing long-term dependencies. LTSM has several gates: forget, input and output. Forget gate maintains the information of the previous state. The input gate updates the current state using the input. The output gate decides the information to be passed to the next state. The ability to forget and retain only the important tings enables LSTM to remember over a longer time period.
Similarity learning
The process of learning how two images are similar. A score can be computed between two images based on the semantic meaning.
Detection or localization and segmentation
Detection or localization is a task that finds an object in an image and localizes the object with a bounding box.
Segmentation is the task of pixel-wise classification. This gives a fine separation of objects.
Image classification is the task of labelling the whole image with an object/concept with confidence.
Data augmentation
Data augmentation gives ways to increase the size of the dataset. Data augmentation introduces noise during training, producing robustness in the model to various inputs. Useful when dataset is small. Techniques include flipping, random cropping, shearing, zooming, rotation, whitening (done by PCA that preserves only the important data), normalization (normalizes the pixels by standardizing the mean and variance), channel shifting (colour channels are shifted to make the model robust to colour changes caused by various artifacts).
Transfer Learning
Transfer learning is the process of learning from a pre-trained model that was trained on a larger dataset. Training a model with random initialization often takes time and energy to get the result. Initializing the model with a pre-trained model gives faster convergence, saving time and energy.
Training on bottleneck features
Complex models can be built from simpler models, not from scratch. Bottleneck features are extracted and the classifier is trained on them. Bottleneck features are the features that are produced by complex architectures training several million images. The images are done with a forward pass and the pre-final layer features are stored. From these, a simple logistic classifier is trained for classification. This gives a different approach to training the model and is useful when the training data is low. This is often a faster method to train a model. Only the final activations of the pre-trained model are used to adapt to the new task. Example: VGG (exclude top layer) - run the image through it via prediction, these feed the pre-final layer features into a sequential model for training.
A pre-trained model can be loaded and only a few layers can be trained. Use when dataset is smaller. Training a deep network on a small dataset –> overfitting. This can be avoided using fine-tuning. The model trained on a bigger dataset should be similar - hoping the activations features are similar to the smaller dataset. Load VGG, set initial layers to non-trainable. Replace the fully connected layers with new trainable layers.
Depending on the data size, the number of layers to fine-tune can be determined. The less data, the lesser the number of layers to fine-tune.
Underfitting happens when the model is too small and can be measured when training accuracy is less. Underfitting can be solved by the following: (1) more data (2) try a bigger model (3) if the data is small, try transfer learning and/or data augmentation
Overfitting happens when the model is too big and there is a large gap between training and testing accuracies. Solution: (1) regularizing with dropout and/or batch norm (2) data augmentation.
Class imbalance
Class imbalance can be dealt with by weighting the loss function.
Image Retrieval
Deep learning can also be called representation learning because the features or representations in the model are learned during training. The visual features generated during the training process in the hidden layers can be used for computing a distance metric. These models learn how to detect edges, patterns etc, depending on the classification task. These can be used to compute similarity between a query image and the set of targets using those features and increase the speed of the retrieval system.
Deep Learning Visualisations
‘black box’ since DL models are non-linear due to activation functions so cannot be visualised easily. BUT, visualisation can be done using the activation and gradient of the model. The activation can be visualized using:
1. Nearest neighbor - a layer activation of an image can be taken and the nearest images of that activation can be seen together.
2 Dimensionality reduction - the dimension of the activation can be reduced by PCA and t-SNE for visualizing in two/three dimensions. PCA reduces the dimension by projecting the values in the direction of max variance. t-SNE reduces the dimension by mapping the closest points to three dimensions.
3. Maximal patches - one neuron is activated and the corresponding path with maximum activation is captured
4. Occlusion: the images are occluded (obstructed) at various positions and the activation is shown as heat maps to understand what portions of the images are important.
The neuron activations can be amplified at some layer in the network rather than synthesizing the image (as in guided backprop). This concept of amplifying the original image to see the effect of features is called DeepDream.
- Take an image and pick a layer from CNN
- Take the activations at a particular layer.
- Modify the gradient such that the gradient and activations are equal.
- Compute the gradients of the image and backpropagate.
- Image has to be jittered and normalized using regularization
- The pixel values should be clipped
- Multi-scale processing of the image is done for the effect of fractal.
Model inference
Any new data can be passed to the model to get the results. This process of getting the classification results or features from an image is termed as inference.
One-shot learning
The technique of learning with just one example. In this case, an image can be shown and it can tell whether they are similar. For most of the similarity learning tasks, a pair of positive and negative pairs are required to train.
Vanishing Gradient Problem
If the weight initialization of NN is sloppy, these non linearity functions can saturate and stop learning. Training loss will be flat and refuse to go down. For example, if your weight matrix W is initialized too large, the output of the matrix multiple could have a v large range, which is turn will make all the outputs in the vector z almost binary: 1 or 0 (using sigmoid). If this is the case, then, z*(1-z) which is the local gradient of the sigmoid non-linearity, will become 0 (vanish) in both cases, which will make the gradient for both x and W also zero. The rest of the backward pass will come out all zero from this point onward on account of the multiplication in the chain rule.
Chain rule
used for computing the derivative of the composition of two or more functions.
Elastic-net regularization
Combines the L1 regularization with the L2 regularization: lambda1|w| + lambda2w2
Max-norm constraints regularization
Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint.
Curse of dimensionality
The number of data points needed to fill the available space grows exponentially with the number of dimensions (or plot axes). If a classifier is not fed with data points that span the entire feature space, the classifier will not know what to do once a new data point is presented that lies far away from all the previously encountered data points.
In practice, the curse of dimensionality means that for a given sample size, there is a maximum number of features, above which the performance of our classifier will degrade rather than improve.