Sign Language Flashcards
Video recognition: what is the problem with convnet + pooling of features
No temporal info
What are main type of networks for video recognition?
2d ore train convnet with temporal aggregation via pooling or lstm
3 D model
One of the best is i3d an inflated model created from 2d inception model
What is an important data preprocess on video recognition?
A subsample from the 25 video frame per second to something like 2 -5
What is the problem with 3d video recognition models?
They have a lot of parameters and so they are hard to train so usually they use shallow architectures.
The video are usually subsample both in pixels resolution and time.
also look at temporal strides
What feature were used in Oscar paper?
Full body
Hands
Mouth
Sign paper: what models did they use to embed each feature?
I3d for video
Avhubert for non manual sign
Deep hand on hands