Action & Activity Recognition & Slef-supervised Learning Flashcards
Draw the CNN architectures and write the names of each one
Single frame: Network sees a frame at a time (no temporal information)
Late Fusion: Network sees 2 frames separated by F=15 Frames. Just the last layers have temporal information
Early Fusion: Incorporates temporal information by modifying the firsts convolutional layers
Slow Fusion: Higher layers in the hierarchy have more access to temporal information
3 different paradigms for incorporating the temporal dimension within NN and explain each of them.
Two-stream CNNs: Uses 2 separated streams for information, one for spatial features (RGB) and one for temporal features (stacked frames)
3D-CNNs: Uses 2D architectures and extend um with a spatiotemporal dimension, so can process frames of a video
LSTM: Trough gates mechanisms and made to work with sequential data, enable different amount of data and can drop or retain important information, creating a better long term memory from the previous inputs.
Explain the concept of inflation in the inflated 3D Network for action recognition. Name one advantage
the concept of I3D is to use the 2D architecture from inception block and add a temporal dimension (From NxN dimensions to NxNxN).
The advantage is that is easy to pre-train due large 2D datasets
What is self-supervised learning? Name and explain one approach for images and one for videos.
self-supervised learning is an approach that the data itself provides the supervision, with no need of labels in the dataset.
One approach for image is Colorizing (Predict the right colocr for a Greayscale version of the image). One approach for videos is temporal order classification (Shuffle and Learn), that predicts if the video is “in order” or “out of order”
Name three data augmentation techniques that can be used for video classification
Random crop and resizing; Horizontal flips and random temporal sampling
Name 2 self-supervised proxy tasks for representation learning that are based on context or position
Randomly shuffle 3x3 grid of image → Solve the puzzle
Sample 2 image crops from 3x3 grid → predict the position of 2and crop
Some Neural architectures can be used for reconstruction-based proxy tasks. Name such an architecture as well as color-based reconstruction proxy task.
Encoder / Decoder ; Task: Colorizing a grayscale image