LLM – Language Modeling Foundations Flashcards
Lecture 08
What is a Language Model (LM)?
It is a model that assigns probabilities (score) to sequences of words. It can say P(“I want it two letter go”) < P(“I wanted to let her go”)
What is an N-Gram
It is a sequence of n tokens (words, tokens, punctuations…)
How to calculate the probability of bi-gram based on some corpus of data?
- Make a table of every possible combination of tokens.
- Estimate the probability of a bigram
How to calculate a probability of a sequence of words using probability of bigrams?
By splitting the sequence in bigrams and multiplying them. Note the start and end of sentence tokens.
What are tricky things when working with some larger N-Gram models?
- Sparsity problem: not enough of data do capture all possible token combinations
- With 4-Gram models, we have to add 3 start sentence tokens so we can say: P 𝐼| < 𝑆 >< 𝑆 >< 𝑆 > to find the probability that I was in the beginning of a sentence.
- Since the large N-Grams are rare, we have problems when multiplying very small probabilities with floating point. We can take logP, but then we sum logP instead of multiplying P.
How to evaluate language models?
There are two ways:
1. Extrinsic
Use the model in the actual use case, in the application. Usually very expensive to do that, not easy to find the use case
- Intrinsic
Use a part of the training corpus to evaluate the model. But the data has to be unseen to the model (VERY IMPORTANT). You ask the model ‘How likely is this text’ and model should say high probability since it is the real text.
Data is split into train/dev/test sets. Train used for training, dev used from time to time to test hyperparams or to test the overfitting/underfitting. Test is used in the end
What is perplexity?
When testing the LM, we give it some test data and ask for raw probabilities that this text occurs. This should be combined into one metric called perplexity. Used to not deal with small numbers and deal with float point issues. (comparing normal numbers instead of 0.0000..) Number has no meaning, just for comparison.
How is the text generation performed with LMs?
Given some sequence of tokens, either:
- Find the most probable one and output only that one
- Get top100 tokens for example, and roll a weighted dice to get some randomness (usually how it is done)
With Bigram LMs, the generated token only depends on the previous token.
Generation highly depends on the training corpus!!
Describe OOV problem in language models
If we train a model of corpus of type A and we test by asking questions from type B, there will be some zeroes in the probabilities which indicates that the model has never seen that word. So, when multiplying probabilities, the whole prob becomes 0 and perplexity becomes undefined (division by zero).
In generation. if LMs has never seen the word, it can not take it into account (zero prob).
How to deal with OOV in language models?
- Choose a predefined vocabulary and convert and word that is out of the vocabulary to <UNK>.</UNK>
- Choose top N words in the training data and replace the rest with <UNK> token. Replace all words that occur fewer than n times with <UNK></UNK></UNK>
- Smoothing (add-one or laplase smoothing). Assume that every token has been seen at least once (more than 0) P(wi) = c(w1) + 1 / N + V
Where V is the vocab size and N is the total number of tokens - Add-k smoothing. Instead of seeing every token at least once, we say at least k-times where k is some number smaller than 1. It reduces the smoothing
- Backoff: If we can’t find some N-Gram evidence, go to N-1 Gram and try.
P(Thomas Arnold is king) = 0, then do P(Thomas Arnold is) * P(Arnold is kind). If there are also zeroes, use N-2… - Interpolation
Instead of backing off when we get zero, we calculate all different N-Grams as a weighted sum. Weights are hyperparams and should be fine-tuned.
P(Wn | Wn-1, Wn-2) = L1P(Wn) + L2P(Wn | Wn-1) * L3*P(Wn | Wn-2, Wn-1)
Where L1 + L2 + L3 = 1
Why should we use subword tokenization instead of word-tokenization?
Because of the OOV problem. If we can’t find a word, we can maybe find parts of the word or even letters.
Explain Byte Pair Encoding
Explain Wordpiece
Models like BERT or DistilBERT use this tokenization.
Similar to BPE:
- Initializes the vocabulary with every character in the training data
- Learns a predefined number of merge rules
Difference is that, instead of finding the most common pair and merging them, it finds where P(x, y) / P(x) * P(y) is the greatest. In other words, where the likelihood is the greatest.
is the continuation token (continuation of a sequence).
How to tokenize the sentence having a vocab? Greedy longest prefix matching. Take a word, see if it exists in the vocab. If yes, tokenize it and go to next word. If not, take the longest prefix such that it exists in the vocab. Nothing else, not thinking if there is a better token if we don’t use the longest.
Explain the SentencePiece
Since BPE and WordPiece assume words are separated by a white space, but not all languages do that. SentencePiece treats white spaces like other characters. It then uses BPE or similar algos to merge tokens.
For de-tokenization, no need to know some language (like to add a white space betweem tokens or smth else, just combine tokens)