CNN models et al. Flashcards
AlexNet
Krizhevsky et al. proposed AlexNet. This model won the ImageNet 2013 challenge. Relatively simple architecture with 5 convolution layer. AlexNet used ReLU as the activation function and found it was training several times faster than other activation functions. The paper also used data augmentation techniques such as image translations, horizontal flips and random cropping. The dropout layer prevents overfitting. The model used vanilla SGD for training. The learning rate changes over a fixed set of training iterations. The momentum and weight decay take fixed values for training. There is a concept called Local Response Normalization (LRN) introduced in this paper - the LRN layers normalize every pixel across the filters to avoid huge activation in a particular filter, this is not used anymore as recent research suggests there is not much improvement. AlexNet has 60 million parameters in total
VGG-16 model
The VGG model stands for the VIsual Geometry Group from Oxford. Greater depth than AlexNet. The paper had 2 models with 16 and 19 layers depth. All the CNN layers were using 3x3 filters with stride and pad of size 1 and a max pooling size of 2 with stride 2. This resulted in a decrease in the number of parameters. Though the size is decreasing because of max pooling, the number of filters is increasing with layers. This model has 138 million parameters. But the uniformity of parameters is quite good. The characteristic is such that, as deep as the network gets, the smaller the image is with an increased number of filters. One of the data augmentation techniques used was scale jittering, where a side with random size is considered to vary the scales.
Google Inception-V3 model
Inception-V3 proposed by Szegedy et al. introduced the concept of inception that has a better way of generalization. Won ImageNet competition in 2014. Geared towards efficiency for speed and size. It has 12 times less parameters than AlexNet. Inception is the micro-architecture on which a macro-architecture is built. Each hidden layer has a higher-level representation of the image. At each layer, have an option of using pooling/other layers. Instead of using one type of kernel, inception uses several kernels. An average pooling is followed by various size convolutions and then they are concatenated. The 1x1 convolution will reduce the feature and computations –> less RAM during inference. Operations are happening in parallel, as opposed to AlexNet or VGG. The output volume is huge, hence 1x1 filters are introduced for dimensionality reduction. There are 9 inception modules with a total of 100 layers.
The Microsot ResNet-50 model
ResNet was proposed by He et al. and won ImageNet in 2015. This method showed that deeper networks can be trained. Send the gradients directly to the deeper layers with a residual block. Every two layers are connected forming a residual block. By this technique, the backprop can carry the error to earlier layers.
The SqueezeNet model
The SqueezeNet model was introduced by Iandola et al. to reduce the model size and the number of parameters. The network was made smaller by replacing 3x3 filters with 1x1 filters. The number of inputs of the 3x3 filters has also reduced downsampling of the layers when happening at the higher level, providing large activation maps
Spatial transformer networks
The spatial transformer networks proposed by Jaderberg et al. try to transform the image before passing to the CNN. This is different from other networks because it tries to modify the image before convolution. This network learns the parameters to transform the image. The parameters are learned for an affine transformation. By applying an affine transformation, spatial invariance is achieved. In the previous networks, spatial invariance was achieved by max-pooling layers.
The DenseNet model
DenseNet is an extension of ResNet proposed by Huang et al. In ResNet blocks, the previous layer is merged into the future layer by summation. In DenseNet, the previous layer is merged into the future layer by concatenation. DenseNet connects all the layers to the previous layers and the current layer to the following layers. This way ,it provides several advantages such as smoother gradients, feature transformation etc. This also reduces the number of parameters.
Autoencoders
An autoencoder is an unsupervised algorithm for generating efficient encodings. The input layer and the target output is typically the same. The layers between decrease and increase. The bottleneck layer in the middle has a reduced dimension. To the left of the bottleneck = encoder, and to the right = decoder. An encoder typically reduces the dimension of the data and a decoder increases the dimensions. Combo = autoencoder. The whole network is trained with reconstruction error. Theoretically, the bottleneck layer can be stored and the original data can be reconstructed by the decoder network. This reduces dimensions.
- define a convolution
- Define a deconvolution (tf.layers.conv2d_transpose)
- fully connected layer.
Autoencoder is a lossy compression algorithm, it learns the compression pattern from the data. The size of the bottleneck layer can be larger than previous layers. In such a case of diverging and converging connections are sparse autoencoders.
Siamese networks
A Siamese network is a neural network model where the network is trained to distinguish between two inputs. A Siamese network can train a CNN to produce an embedding by two encoders. Each encoder is fed with one of the images in either a positive or a negative pair. A siamese network requires less data than the other deep learning algorithms. Siamese networks were originally introduced for comparing signatures. The weights are shared between the networks. Define a Siamese network: two encoders are defined, and the latent space is concatenated to form the loss of training. The left and right models are fed with data separately.
Other use is for one-shot learning.
LeNet
ILSVRC 2010, Yann Lecun built LeNet 5. This network takes a 32x32 image as input, which goes to the convolution layers and then to the subsampling layer. Today, the subsampling layer is replaced by a pooling layer. Then there is another sequence of convolution layers –> pooling –> 3 fully connected layers, inc the OUTPUT layer at the end. This network was used for zip code recognition in post offices. CONV filters are 5x5, applied at a stride of 1.
Recurrent models of visual attention
These models use the hard attention method. One of the popular variants of recurrent methods: the Recurrent Attention Model (RAM). Hard attention problems are non-differentiable and have to use RL for the control problem. The RAM uses RL for this optimization. A recurrent model of visual attention doesn’t process the entire image or even a sliding window-based bounding box, at once. It mimics the human eye and works on the concept of Fixation of Gaze at different locations of an image; with each Fixation, it incrementally combines information of important to dynamically build up an internal representation of scenes in the image. It uses an RNN to do this in a sequential manner.