Action & Activity Recognition & Slef-supervised Learning Flashcards

Question 1

Q

Draw the CNN architectures and write the names of each one

Answer

A

Single frame: Network sees a frame at a time (no temporal information)

Late Fusion: Network sees 2 frames separated by F=15 Frames. Just the last layers have temporal information

Early Fusion: Incorporates temporal information by modifying the firsts convolutional layers

Slow Fusion: Higher layers in the hierarchy have more access to temporal information

Question 2

Q

3 different paradigms for incorporating the temporal dimension within NN and explain each of them.

Answer

A

Two-stream CNNs: Uses 2 separated streams for information, one for spatial features (RGB) and one for temporal features (stacked frames)
3D-CNNs: Uses 2D architectures and extend um with a spatiotemporal dimension, so can process frames of a video

LSTM: Trough gates mechanisms and made to work with sequential data, enable different amount of data and can drop or retain important information, creating a better long term memory from the previous inputs.

Question 3

Q

Explain the concept of inflation in the inflated 3D Network for action recognition. Name one advantage

Answer

A

the concept of I3D is to use the 2D architecture from inception block and add a temporal dimension (From NxN dimensions to NxNxN).
The advantage is that is easy to pre-train due large 2D datasets

Question 4

Q

What is self-supervised learning? Name and explain one approach for images and one for videos.

Answer

A

self-supervised learning is an approach that the data itself provides the supervision, with no need of labels in the dataset.
One approach for image is Colorizing (Predict the right colocr for a Greayscale version of the image). One approach for videos is temporal order classification (Shuffle and Learn), that predicts if the video is “in order” or “out of order”

Question 5

Q

Name three data augmentation techniques that can be used for video classification

Answer

A

Random crop and resizing; Horizontal flips and random temporal sampling

Question 6

Q

Name 2 self-supervised proxy tasks for representation learning that are based on context or position

Answer

A

Randomly shuffle 3x3 grid of image → Solve the puzzle

Sample 2 image crops from 3x3 grid → predict the position of 2and crop

Question 7

Q

Some Neural architectures can be used for reconstruction-based proxy tasks. Name such an architecture as well as color-based reconstruction proxy task.

Answer

A

Encoder / Decoder ; Task: Colorizing a grayscale image

Action & Activity Recognition & Slef-supervised Learning Flashcards

(7 cards)