Language Models Flashcards

1
Q

Challenges in LM

A

Vanishing probabilities:

Unknown words and sequences

Exactness vs generalizations:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Exactness vs generalizations:

A
  • The higher n, the more exact the estimated probabilities
  • Sometimes, less context(i.e., a lower n) may aid generalization
  • Two techniques to deal with this problem are backoff and interpolation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unknown words and sequences

A
  • Some tokens may never appear in a training corpus
  • Even without unknown tokens, there may always be sequences s that do not appear in training corpus, but appear in other data.
  • A technique used to deal with these problems is called smoothing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Vanishing probabilities:

A
  • In real-world data, the probability of most token sequences s is near 0, which may lead to vanishing probabilities
  • A way to deal with this problem is to use log probabilities
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When to use LMs?

A
  • Probabilities of token sequences are essential in any task where tokens have to be inferred from ambiguous input
  • Ambiguity may be due to linguistic variations or due to noise
  • LMs are a key technique in generation, but are also used for analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Selected applications

A

Speech recognition: Disambiguate unclear words based on likelihood
Spelling/Grammar correction: Fin likely errors und suggest alternatives
Machine translation: Find likely interpretation/order in target language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to stop generating text?

A

The maximum length of the output sequence may be prespecified
Also, LMs may learn to generate a special end tag, </s>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Large language model(LLM)

A

A neural language model trained on huge amounts of textual data
Usually based on the transformer architecture

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Transformer

A

A neural network architecture for processing input sequences in parallel
Models each input based on its surrounding inputs, called self-attention

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What n to use? (N-Gram language Model)

A

-Bigrams are used in the examples above mainly for simplicity
-In practice, mostly trigrams, 4-grams, or 5-grams are used
- The higher n, the more training data is needed for reliable probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Evaluation of LM

A
  • Extrinsic: Measure/Compare impact of LMs within an application
  • Intrinsic: Measure the quality of LMs independent of an application
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Perplexity

A

The perplexity PPL of an LM on a test set is the inverse probability of the test set, normalized by the numbers of tokens

Notice:

  • Perplexity values are comparable only for LMs with same vocabulary
  • Better(so, lower) perplexity does not imply more extrinsic effectiveness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Branching factor

A
  • The number of text tokens in a language that can follow any token
  • Perplexity can be understood as the weighted average branching factor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Sampling of sequences

A
  • The probabilities of an LM encode knowledge from the training set
  • To see this, sequences s can be sampled based on their likelihood P(s)

Unigram sampling

  • Decompose the probability space [0,1] into intervals, each reflecting the probability of one unigram from the LM vocabulary
  • Choose a random point in the space, and write the associated unigram.
  • Repeat this process until </s> is written

Bigram sampling

  • Same technique, starting by sampling random w1=w from P(w1|<s>)</s>
  • Repeat process for P(W2|W) and so forth, until </s> is written
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sparsity

A
  • n-grams frequent in a training set may get reliable probability estimates.
  • But even huge training sets will not contain all possible n-grams

Why are zero probabilities problematic?

  • The probability of any unknown token (sequence) is underestimated
  • If any test set probability is 0, the probability of the entire test set is 0
  • No next token can be predicted for any unknown token or sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unknown tokens

A

Out-of-vocabulary (OOV) tokens

  • OOV tokens are those that appear in a test set but not in a training set.
  • They are unknown to an LM built on the training set.
  • Common examples: Slang words, misspellings, URLs, rare words,…

Solution:

  • Replace all unknown tokens in a test set by a special tag, <UNK></UNK>
  • As for any token, estimate the probability of <UNK> on the training set</UNK>
  • Two comman ways to obtain <UNK> training instances exist</UNK>

Alternative 1: Closed vocabulary

  1. Choose a fixed vocabulary of known tokens in advance
  2. Convert any other (OOV) token to <UNK></UNK>

Alternative 2: Frequency pruning

  1. Choose a minimum absolute or relative frequency threshold, t
  2. Convert any token with training frequency < t to <UNK></UNK>
17
Q

Smoothing

A

Unknown sequences

  • Even if all tokens in a sequence s are known, s as a whole might have never appeared in a training set
  • Techniques to avoid that P(s) =0 in such cases are called smoothing

General idea of smoothing (aka discounting)

  • Reduce the probability mass of known sequences
  • Distribute gained mass over unknown sequences
18
Q

Main types of smoothing

A
  • Laplace smoothing and add-k smoothing
  • Backoff, simple interpolation, and conditional interpolation
  • Absolute discounting and Kneser-Ney smoothing
  • Stupid backoof