ch 2 Flashcards

1
Q

What is the problem associated with modeling natural languages as opposed to formal languages?

A

The problem is that formal languages, like programming languages, can be precisely defined.

while natural languages are inherently ambiguous and subject to constant change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what is Language modeling (LM) ?

A

involves applying statistical and probabilistic techniques to assess the likelihood of a specific sequence of words occurring in a sentence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Statistical Language Modeling (LM)?

A

Statistical Language Modeling involves developing probabilistic models capable of predicting the next word in a sequence based on preceding words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

______ and _____ are the 2 Types of language models

A
  • Statistical Language Modeling (LM)
  • Neural Language Models (NLM)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the tasks associated with Language Modeling?

A
  • assigning probabilities to sentences in a language
  • evaluating the probability for each sequence of words
  • estimating the likelihood of a given word (or sequence of words) following a specific word sequence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are examples of Statistical Language Models?

A
  • N-gram Model,
  • Bidirectional Model,
  • Exponential Model, and
  • Continuous Space Model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Neural Language Model (NLM)?

A

refer to the utilization of neural networks in language modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

________ Analyzes text backward and creates a probability distribution of sequences “n.”

A

N-gram Model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

_______ Utilizes an equation combining n-grams and other parameters for text evaluation, offering higher accuracy than the n-gram model.

A

Exponential Model:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

_______ is Based on weighting each word (word embedding), particularly useful for large texts or datasets.

A

Continuous Space Model:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

_____ address the data sparsity issue of n-grams.

A

NLMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why is Probability necessary?

A

Essential in tasks with noisy, ambiguous inputs like speech recognition.
■ P(back soonish)&raquo_space; P(bassoon dish).

Crucial for writing tools (spelling and grammar correction) to detect and correct errors.
■ P(There are)&raquo_space; P(Their are)
■ P(has improved)&raquo_space; P(has improve)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

_____ is the simplest language model that assigns probabilities to sequences of words.

A

The n-gram

○(bigram) : “please turn”, “turn your”, ”your homework”,

○ (a trigram) : “please turn your”, “turn your homework”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do Language model compute probabilities?

A
  • P(w|h), the probability of the next word w given some history h.
  • P(the | its water is so transparent that)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to calculate this?

A
  1. P(A/B) = P(A U B) / P(B)
  2. joint probability:
    “Out of all possible sequences of 5 words,
    how many of them are ‘its water is so
    transparent?”
  3. The Chain Rule
    • P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
    • P(“its water is so transparent”) =
      P(its) × P(water|its) × P(is|its water) × P(so|its water is) x P(transparent|its water is so)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the Markov models

A
  • assumes we can predict the probability of some future unit without looking too far into the past.
  • P(“its water is so transparent that”) = P(the / that)
  • P(“its water is so transparent that”) = P(the / transparent that)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

explain Markov model for unigram and bigram

A

● Unigram Model: P ( Wi )
● BigramModel:P(Wi|Wi-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

extend Markov model to trigrams, 4-grams, 5-grams

A
  • Trigram Model:
    P(wi | wi-1, wi-2) - Probability of word wi given the two previous words wi-1 and wi-2.
  • 4-gram Model:
    P(wi | wi-1, wi-2, wi-3) - Probability of word wi given the three previous words.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Trigrams, 4-grams, or 5-grams are preferred over bi-grams for more context and improved accuracy.

A

t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Log probabilities are used to avoid _____

A

numerical underflow, especially when multiplying small probabilities together.

21
Q

The evaluation process involves training a language model on a designated _____ and assessing its ability to distinguish between well-formed and poorly formed sentences using a separate, ________ .

A
  • training set
  • unseen test set
22
Q

what are the 2 Types of evaluations

A
  • Extrinsic evaluation
  • Intrinsic evaluation
23
Q

explain extrinsic evaluation.

A
  • Is a method that Assesses system outputs based on their impact on the performance compared to other NLP systems.
  • The most effective evaluation deploys each model (A and B)
  • then Measure the accuracy of each model in performing the task.
24
Q

what is the down side and solution to Extrinsic evaluation

A

● Executing a task for multiple models can be expensive.
● Solution: Intrinsic Evaluation

25
Q

explain Intrinsic Evaluation.

A
  • Evaluates NLP system outputs against predetermined ground truth.
  • measures the quality of a model independently of any specific application.
26
Q

Do not include _____ in the training set to prevent bias.

A

test sentences

27
Q

Repeatedly using the same test set may lead to tuning models to its specific characteristics.

A

t

28
Q

How can we Ensure efficient evaluation while maintaining the ability to detect statistically significant differences between models.

A

Select the smallest test set that provides sufficient statistical power.

29
Q

explain Perplexity

A

● Metric: Is used instead of raw probability for evaluating language models.

● Definition: Inverse probability of the test set normalized by the number of words.

30
Q

Higher conditional probability = _______

A

lower perplexity.

31
Q

Recognizing digits ( 0-9 ) with equal probability results in perplexity of ____

A

of 10. Because probability would be 1/10.

32
Q

Improvement in perplexity does not guarantee enhancement in extrinsic tasks.

A

t

33
Q

Statistical models, particularlly N-grams only work well for word prediction if _______

A

the test corpus looks like the training corpus

  • Shakespeare and WSJ.
34
Q

______ Happens when the models are too closely tailored to the training corpus, making them less adaptable to diverse datasets.

A

Overfitting

35
Q

______ is the solution to Overfitting?

A

Generalization

36
Q

How to achieve Generalization ?

A
  1. Choose a training corpus that aligns with the genre of the task.
  2. Consider the appropriate dialect or variety, especially in processing social media or spoken language.
  3. Matching genres and dialects is not enough; models can still face sparsity challenges.
    • Modify the Maximum Likelihood Estimation (MLE) method.
    • Assign some non-zero probability to any N-gram, even if it was not
      observed in training.
37
Q

______ refers to modifications addressing poor estimates in small data sets.

A

Zeros – Smoothing

38
Q

explain Zeros – Smoothing

A
  • makes the probability distribution less jagged by shaving a little bit of probability mass from the higher counts, and piling it instead on the zero counts
39
Q

explain Laplace Smoothing

A

Adds one to all counts in the bigram matrix.

P(w/h) = c(h U w) + 1 / c(h) + v

  • v = vocab size
40
Q

Add-1 estimation is considered a blunt instrument.

A

t

41
Q

add-1 is employed for smoothing in ____ and _______ .

A
  • text classification
  • domains with fewer zero counts
42
Q

explain Backoff:

A

Uses trigram if evidence is strong, otherwise it backs off and uses the bigram, and finally falls back to the unigram.

  • Sometimes it helps to use less context.
43
Q

______ mixes unigram, bigram, and trigram.

A

Interpolation
- adds

44
Q

out of back off and interpolation ______ is the preferred approach.

A

interpolation

45
Q

what are the 2 types of interpolation?

A
  • linear interpolation
  • conditional interpolation
46
Q

Lambdas are trained on the previous 2 words in _______ interpolation?

A

conditional

47
Q

_______ are common In open vocabulary tasks?

A

unknown words (Out Of Vocabulary - OOV)

48
Q

How do we Handle Unknown Words?

A

○ Introduce an unknown word token <UNK>.
○ Create a fixed lexicon L of size V.
○ During text normalization, replace any training word not in L with <UNK>.
○ Train <UNK> probabilities like a regular word.
○ At decoding time, use <UNK> probabilities for any word not in training.</UNK></UNK></UNK></UNK>