Introduction to Transformers for NLP Flashcards
RNN
Recurrant Neural Network
In an RNN, the information is repeated endlessly within a loop
Bag of Words
n-grams
trigram
A trigram model keeps the context of the last two words to predict the next word in the sequence
LSTM
Long Short-Term Memory
GRU
Gated Reccurrent Unit
Feed-forward Neural Network
BP Mechanism
Back Propogation
Gradient Decent
T5 model
seq2seq
sequence-to-sequence neural network
multi-head attention
multiple modules of self-attention capturing different kinds of attentions.
feed-forward
masked multi-head attention
linear
softmax
The softmax function is a mathematical function that converts a vector of real numbers into a probability distribution, where each value is between 0 and 1 and all values sum to 1.
input embeddings
output embeddings
tokenize
vectorize
positional encoding
self-attention
self-attention allows us to associate each word in the input with other words in the same sentence
query vector
key vector
value vector
embedding vector
residual connection
The original positional input embedding is then given the multi-headed attention output vector as an additional component
decoder
- Multi-headedattention layer
- Add and norm layers
- Feed-forward layer
encoder
BERT
Bidirectional Encoder Representations from Transformers
BERT-Base
BERT-Base has a total of 110 million parameters, 12 attention heads, 768 hidden nodes, and 12 layers.
BERT-Large
BERT-Large is characterized by having 24 layers, 1024 hidden nodes, 16 attention heads, and 340 million parameter values.
Masked-LM
MLM
NSP
Next Sentence Prediction
CLS
Always consider the first token in a sequence to be a special classification token (also abbreviated CLS)
SEP
The [SEP] token serves to demarcate the break between the two sentences.