NLP Flashcards
Tokenisation approaches
Word-based (“It”, “‘s”, “1”, uses language-specific rules), Character-based (“I”, “t”, “’”), Subword-based (“_It”, “_a”, “_gre”, learned for particular dataset/language)
xxbos / xxeos
Marks the beginning/end
xxmaj
Capital letter
xxunk
Placeholder for unknown token
xxrep
Count of repetition (xxrep, 4, !)
xxwrep
Same as xxrep, but with words instead of single characters
xxup
Before a token that is written all-caps
What is numericalization?
Mapping of tokens to integers
What is vocab?
List of all tokens in the training set (by convention sorted by frequency)
One-hot encoding?
Instead of index numbers, use binary vectors
What is embedding?
technique for representing words, phrases, or even entire sentences as dense, continuous vectors in a high-dimensional space. These embeddings capture semantic information, meaning that words with similar meanings tend to be mapped to vectors that are close together in the vector space.
Self-supervised learning?
Learn about the domain by the structure of the data itself
Here: Given the previous tokens, predict the next token
Training procedure for self-supervised learning?
First: Fine-tune generic pre-trained model using self-supervised learning on unlabeled texts from the dataset to learn embeddings and the encoder
Second: Use fine-tuned language model to train task-specific classifier with supervised learning
Default loss metric for NLP?
“Perplexity” (exp(cross_entropy))
Can you generate new text with next token prediction?
With a language model you can generate new texts by using its own next token predictions.
Image data is naturally numeric, while the same is not true for text data.
True
Numericalization means to exchange each token with its embedding.
True
Advantages of subword tokenization
Handling Out-of-Vocabulary (OOV) Words, Dealing with Morphologically Rich Languages, Reducing Vocabulary Size, Improved Generalization