Handout #10 - Attention Mechanism and Transformer Model Flashcards
1
Q
What’s the problem with RNN?
A
- Sequential in nature -> can’t parallelise
- context computed from past only
- no explicit distinction between short and long range dependencies
- training is tricky
1
Q
Why can’t CNN and DNN be used in text processing?
A
The dependencies between different words aren’t between the current and previous context sample
-> the verb is not always the next word after the subject.
2
Q
What’s cross-attention?
A
CA allows you to work with multiple modailities (e.g. audio, video, images, text)
-> works because it doesn’t depend to the position of the keys/values and can deal with any synchronisation issues.
3
Q
Give the layman’s definition of the Transformer
A
It’s an encoder/decoder network that is solely based on sequences of Attention Layer blocks