Image Caption / Tagging (RNN) Flashcards
Name two metrics that can be used to evaluate the performance of an image captioning system. With kind of outputs are favored by each of the metrics.
BLEU: Measures the precision; favors shorter sentences (higher chance of precision)
METEOR: Measures Recall; favor larger sentences (higher chance of Recall)
What is image captioning? Explain and draw a typical approach
Image captioning is a task for Computer Vision that aims to generate a descriptive and coherent caption for an input image.
Image Caption uses a convolution neural network, to extract the features from the image, and after uses a RNN, to capture the dependencies between words in a sentence.
Input: Image-> CNN(e.g VGG) ->RNN (LSTMs or GRUs) -> Output: Caption
When does the output generation from caption image stop?
When the NN end their captioning, It generates an end Token <end> or <eos> for "end of sentence"</eos></end>
Name 2 different approaches for video classification
Using 3D CNN, by adding the spatio-temporal volumes and being able to classify different frames; or using an RNN (LSTMs or GRUs), so the input of the network can me flexible an enable to process different amounts of value as frames in a video.
How can the ImageNet dataset be used to initialize a 3D-CNN model?
The imageNet can be used for pre training a 2D CNN for image classification and then be extended to a 3D CNN by adding a temporal dimension and enabling to process frames in a video
Explain and draw a typical soft attention mechanism for image captioning approaches
The soft attention mechanism focuses the processing of the training just for meaningful regions in the image, so can reduce the time and computation process.
Input -> CNN (Encoder) -> Feature Map -> Attention Mechanism -> Context vector (seq-to-seq) -> Decoder -> Output (Generated Caption)
How does hard attention differ from soft attention?
Soft attention → weighted combination of softmax over L locations
Hard attention → picks only the highest scoring location (not very good to train end-to-end)
Name a normalization function that can be used for the attention mechanism. Explain you answer
The softmax function is used in the attention mechanism to normalize the attention weights, ensuring that the weights sum up to 1. This enables the model to focus on relevant image regions effectively during caption generation.