NLP Flashcards
Tokenisation approaches
Word-based (“It”, “‘s”, “1”, uses language-specific rules), Character-based (“I”, “t”, “’”), Subword-based (“_It”, “_a”, “_gre”, learned for particular dataset/language)
xxbos / xxeos
Marks the beginning/end
xxmaj
Capital letter
xxunk
Placeholder for unknown token
xxrep
Count of repetition (xxrep, 4, !)
xxwrep
Same as xxrep, but with words instead of single characters
xxup
Before a token that is written all-caps
What is numericalization?
Mapping of tokens to integers
What is vocab?
List of all tokens in the training set (by convention sorted by frequency)
One-hot encoding?
Instead of index numbers, use binary vectors
What is embedding?
technique for representing words, phrases, or even entire sentences as dense, continuous vectors in a high-dimensional space. These embeddings capture semantic information, meaning that words with similar meanings tend to be mapped to vectors that are close together in the vector space.
Self-supervised learning?
Learn about the domain by the structure of the data itself
Here: Given the previous tokens, predict the next token
Training procedure for self-supervised learning?
First: Fine-tune generic pre-trained model using self-supervised learning on unlabeled texts from the dataset to learn embeddings and the encoder
Second: Use fine-tuned language model to train task-specific classifier with supervised learning
Default loss metric for NLP?
“Perplexity” (exp(cross_entropy))
Can you generate new text with next token prediction?
With a language model you can generate new texts by using its own next token predictions.