Language Models Flashcards
Challenges in LM
Vanishing probabilities:
Unknown words and sequences
Exactness vs generalizations:
Exactness vs generalizations:
- The higher n, the more exact the estimated probabilities
- Sometimes, less context(i.e., a lower n) may aid generalization
- Two techniques to deal with this problem are backoff and interpolation
Unknown words and sequences
- Some tokens may never appear in a training corpus
- Even without unknown tokens, there may always be sequences s that do not appear in training corpus, but appear in other data.
- A technique used to deal with these problems is called smoothing
Vanishing probabilities:
- In real-world data, the probability of most token sequences s is near 0, which may lead to vanishing probabilities
- A way to deal with this problem is to use log probabilities
When to use LMs?
- Probabilities of token sequences are essential in any task where tokens have to be inferred from ambiguous input
- Ambiguity may be due to linguistic variations or due to noise
- LMs are a key technique in generation, but are also used for analysis
Selected applications
Speech recognition: Disambiguate unclear words based on likelihood
Spelling/Grammar correction: Fin likely errors und suggest alternatives
Machine translation: Find likely interpretation/order in target language
How to stop generating text?
The maximum length of the output sequence may be prespecified
Also, LMs may learn to generate a special end tag, </s>
Large language model(LLM)
A neural language model trained on huge amounts of textual data
Usually based on the transformer architecture
Transformer
A neural network architecture for processing input sequences in parallel
Models each input based on its surrounding inputs, called self-attention
What n to use? (N-Gram language Model)
-Bigrams are used in the examples above mainly for simplicity
-In practice, mostly trigrams, 4-grams, or 5-grams are used
- The higher n, the more training data is needed for reliable probabilities
Evaluation of LM
- Extrinsic: Measure/Compare impact of LMs within an application
- Intrinsic: Measure the quality of LMs independent of an application
Perplexity
The perplexity PPL of an LM on a test set is the inverse probability of the test set, normalized by the numbers of tokens
Notice:
- Perplexity values are comparable only for LMs with same vocabulary
- Better(so, lower) perplexity does not imply more extrinsic effectiveness
Branching factor
- The number of text tokens in a language that can follow any token
- Perplexity can be understood as the weighted average branching factor
Sampling of sequences
- The probabilities of an LM encode knowledge from the training set
- To see this, sequences s can be sampled based on their likelihood P(s)
Unigram sampling
- Decompose the probability space [0,1] into intervals, each reflecting the probability of one unigram from the LM vocabulary
- Choose a random point in the space, and write the associated unigram.
- Repeat this process until </s> is written
Bigram sampling
- Same technique, starting by sampling random w1=w from P(w1|<s>)</s>
- Repeat process for P(W2|W) and so forth, until </s> is written
Sparsity
- n-grams frequent in a training set may get reliable probability estimates.
- But even huge training sets will not contain all possible n-grams
Why are zero probabilities problematic?
- The probability of any unknown token (sequence) is underestimated
- If any test set probability is 0, the probability of the entire test set is 0
- No next token can be predicted for any unknown token or sequence