Action & Activity Recognition & Slef-supervised Learning Flashcards

1
Q

Draw the CNN architectures and write the names of each one

A

Single frame: Network sees a frame at a time (no temporal information)

Late Fusion: Network sees 2 frames separated by F=15 Frames. Just the last layers have temporal information

Early Fusion: Incorporates temporal information by modifying the firsts convolutional layers

Slow Fusion: Higher layers in the hierarchy have more access to temporal information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

3 different paradigms for incorporating the temporal dimension within NN and explain each of them.

A

Two-stream CNNs: Uses 2 separated streams for information, one for spatial features (RGB) and one for temporal features (stacked frames)
3D-CNNs: Uses 2D architectures and extend um with a spatiotemporal dimension, so can process frames of a video

LSTM: Trough gates mechanisms and made to work with sequential data, enable different amount of data and can drop or retain important information, creating a better long term memory from the previous inputs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the concept of inflation in the inflated 3D Network for action recognition. Name one advantage

A

the concept of I3D is to use the 2D architecture from inception block and add a temporal dimension (From NxN dimensions to NxNxN).
The advantage is that is easy to pre-train due large 2D datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is self-supervised learning? Name and explain one approach for images and one for videos.

A

self-supervised learning is an approach that the data itself provides the supervision, with no need of labels in the dataset.
One approach for image is Colorizing (Predict the right colocr for a Greayscale version of the image). One approach for videos is temporal order classification (Shuffle and Learn), that predicts if the video is “in order” or “out of order”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name three data augmentation techniques that can be used for video classification

A

Random crop and resizing; Horizontal flips and random temporal sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Name 2 self-supervised proxy tasks for representation learning that are based on context or position

A

Randomly shuffle 3x3 grid of image → Solve the puzzle

Sample 2 image crops from 3x3 grid → predict the position of 2and crop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Some Neural architectures can be used for reconstruction-based proxy tasks. Name such an architecture as well as color-based reconstruction proxy task.

A

Encoder / Decoder ; Task: Colorizing a grayscale image

How well did you know this?
1
Not at all
2
3
4
5
Perfectly