Image Caption / Tagging (RNN) Flashcards

1
Q

Name two metrics that can be used to evaluate the performance of an image captioning system. With kind of outputs are favored by each of the metrics.

A

BLEU: Measures the precision; favors shorter sentences (higher chance of precision)

METEOR: Measures Recall; favor larger sentences (higher chance of Recall)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is image captioning? Explain and draw a typical approach

A

Image captioning is a task for Computer Vision that aims to generate a descriptive and coherent caption for an input image.

Image Caption uses a convolution neural network, to extract the features from the image, and after uses a RNN, to capture the dependencies between words in a sentence.

Input: Image-> CNN(e.g VGG) ->RNN (LSTMs or GRUs) -> Output: Caption

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

When does the output generation from caption image stop?

A

When the NN end their captioning, It generates an end Token <end> or <eos> for "end of sentence"</eos></end>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Name 2 different approaches for video classification

A

Using 3D CNN, by adding the spatio-temporal volumes and being able to classify different frames; or using an RNN (LSTMs or GRUs), so the input of the network can me flexible an enable to process different amounts of value as frames in a video.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can the ImageNet dataset be used to initialize a 3D-CNN model?

A

The imageNet can be used for pre training a 2D CNN for image classification and then be extended to a 3D CNN by adding a temporal dimension and enabling to process frames in a video

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain and draw a typical soft attention mechanism for image captioning approaches

A

The soft attention mechanism focuses the processing of the training just for meaningful regions in the image, so can reduce the time and computation process.

Input -> CNN (Encoder) -> Feature Map -> Attention Mechanism -> Context vector (seq-to-seq) -> Decoder -> Output (Generated Caption)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How does hard attention differ from soft attention?

A

Soft attention → weighted combination of softmax over L locations
Hard attention → picks only the highest scoring location (not very good to train end-to-end)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name a normalization function that can be used for the attention mechanism. Explain you answer

A

The softmax function is used in the attention mechanism to normalize the attention weights, ensuring that the weights sum up to 1. This enables the model to focus on relevant image regions effectively during caption generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly