Language Models Flashcards

Question 1

Q

Challenges in LM

Answer

A

Vanishing probabilities:

Unknown words and sequences

Exactness vs generalizations:

Question 2

Q

Exactness vs generalizations:

Answer

A

The higher n, the more exact the estimated probabilities
Sometimes, less context(i.e., a lower n) may aid generalization
Two techniques to deal with this problem are backoff and interpolation

Question 3

Q

Unknown words and sequences

Answer

A

Some tokens may never appear in a training corpus
Even without unknown tokens, there may always be sequences s that do not appear in training corpus, but appear in other data.
A technique used to deal with these problems is called smoothing

Question 4

Q

Vanishing probabilities:

Answer

A

In real-world data, the probability of most token sequences s is near 0, which may lead to vanishing probabilities
A way to deal with this problem is to use log probabilities

Question 5

Q

When to use LMs?

Answer

A

Probabilities of token sequences are essential in any task where tokens have to be inferred from ambiguous input
Ambiguity may be due to linguistic variations or due to noise
LMs are a key technique in generation, but are also used for analysis

Question 6

Q

Selected applications

Answer

A

Speech recognition: Disambiguate unclear words based on likelihood
Spelling/Grammar correction: Fin likely errors und suggest alternatives
Machine translation: Find likely interpretation/order in target language

Question 7

Q

How to stop generating text?

Answer

A

The maximum length of the output sequence may be prespecified
Also, LMs may learn to generate a special end tag, </s>

Question 8

Q

Large language model(LLM)

Answer

A

A neural language model trained on huge amounts of textual data
Usually based on the transformer architecture

Question 9

Q

Transformer

Answer

A

A neural network architecture for processing input sequences in parallel
Models each input based on its surrounding inputs, called self-attention

Question 10

Q

What n to use? (N-Gram language Model)

Answer

A

-Bigrams are used in the examples above mainly for simplicity
-In practice, mostly trigrams, 4-grams, or 5-grams are used
- The higher n, the more training data is needed for reliable probabilities

Question 11

Q

Evaluation of LM

Answer

A

Extrinsic: Measure/Compare impact of LMs within an application
Intrinsic: Measure the quality of LMs independent of an application

Question 12

Q

Perplexity

Answer

A

The perplexity PPL of an LM on a test set is the inverse probability of the test set, normalized by the numbers of tokens

Notice:

Perplexity values are comparable only for LMs with same vocabulary
Better(so, lower) perplexity does not imply more extrinsic effectiveness

Question 13

Q

Branching factor

Answer

A

The number of text tokens in a language that can follow any token
Perplexity can be understood as the weighted average branching factor

Question 14

Q

Sampling of sequences

Answer

A

The probabilities of an LM encode knowledge from the training set
To see this, sequences s can be sampled based on their likelihood P(s)

Unigram sampling

Decompose the probability space [0,1] into intervals, each reflecting the probability of one unigram from the LM vocabulary
Choose a random point in the space, and write the associated unigram.
Repeat this process until </s> is written

Bigram sampling

Same technique, starting by sampling random w1=w from P(w1|<s>)</s>
Repeat process for P(W2|W) and so forth, until </s> is written

Question 15

Q

Sparsity

Answer

A

n-grams frequent in a training set may get reliable probability estimates.
But even huge training sets will not contain all possible n-grams

Why are zero probabilities problematic?

The probability of any unknown token (sequence) is underestimated
If any test set probability is 0, the probability of the entire test set is 0
No next token can be predicted for any unknown token or sequence

Question 16

Q

Unknown tokens

Answer

Study These Flashcards

A

Out-of-vocabulary (OOV) tokens

OOV tokens are those that appear in a test set but not in a training set.
They are unknown to an LM built on the training set.
Common examples: Slang words, misspellings, URLs, rare words,…

Solution:

Replace all unknown tokens in a test set by a special tag, <UNK></UNK>
As for any token, estimate the probability of <UNK> on the training set</UNK>
Two comman ways to obtain <UNK> training instances exist</UNK>

Alternative 1: Closed vocabulary

Choose a fixed vocabulary of known tokens in advance
Convert any other (OOV) token to <UNK></UNK>

Alternative 2: Frequency pruning

Choose a minimum absolute or relative frequency threshold, t
Convert any token with training frequency < t to <UNK></UNK>

Question 17

Q

Smoothing

Answer

Study These Flashcards

A

Unknown sequences

Even if all tokens in a sequence s are known, s as a whole might have never appeared in a training set
Techniques to avoid that P(s) =0 in such cases are called smoothing

General idea of smoothing (aka discounting)

Reduce the probability mass of known sequences
Distribute gained mass over unknown sequences

Question 18

Q

Main types of smoothing

Answer

Study These Flashcards

A

Laplace smoothing and add-k smoothing
Backoff, simple interpolation, and conditional interpolation
Absolute discounting and Kneser-Ney smoothing
Stupid backoof

Language Models Flashcards

(18 cards)