ch 2 Flashcards by Jane Austin

What is the problem associated with modeling natural languages as opposed to formal languages?

The problem is that formal languages, like programming languages, can be precisely defined.

while natural languages are inherently ambiguous and subject to constant change

How well did you know this?

Not at all

Perfectly

what is Language modeling (LM) ?

involves applying statistical and probabilistic techniques to assess the likelihood of a specific sequence of words occurring in a sentence.

How well did you know this?

Not at all

Perfectly

What is Statistical Language Modeling (LM)?

Statistical Language Modeling involves developing probabilistic models capable of predicting the next word in a sequence based on preceding words.

How well did you know this?

Not at all

Perfectly

______ and _____ are the 2 Types of language models

Statistical Language Modeling (LM)
Neural Language Models (NLM)

How well did you know this?

Not at all

Perfectly

What are the tasks associated with Language Modeling?

assigning probabilities to sentences in a language
evaluating the probability for each sequence of words
estimating the likelihood of a given word (or sequence of words) following a specific word sequence

How well did you know this?

Not at all

Perfectly

What are examples of Statistical Language Models?

N-gram Model,
Bidirectional Model,
Exponential Model, and
Continuous Space Model.

How well did you know this?

Not at all

Perfectly

What is a Neural Language Model (NLM)?

refer to the utilization of neural networks in language modeling.

How well did you know this?

Not at all

Perfectly

________ Analyzes text backward and creates a probability distribution of sequences “n.”

N-gram Model

How well did you know this?

Not at all

Perfectly

_______ Utilizes an equation combining n-grams and other parameters for text evaluation, offering higher accuracy than the n-gram model.

Exponential Model:

How well did you know this?

Not at all

Perfectly

_______ is Based on weighting each word (word embedding), particularly useful for large texts or datasets.

Continuous Space Model:

How well did you know this?

Not at all

Perfectly

_____ address the data sparsity issue of n-grams.

NLMs

How well did you know this?

Not at all

Perfectly

Why is Probability necessary?

Essential in tasks with noisy, ambiguous inputs like speech recognition.
■ P(back soonish)&raquo_space; P(bassoon dish).

Crucial for writing tools (spelling and grammar correction) to detect and correct errors.
■ P(There are)&raquo_space; P(Their are)
■ P(has improved)&raquo_space; P(has improve)

How well did you know this?

Not at all

Perfectly

_____ is the simplest language model that assigns probabilities to sequences of words.

The n-gram

○(bigram) : “please turn”, “turn your”, ”your homework”,

○ (a trigram) : “please turn your”, “turn your homework”.

How well did you know this?

Not at all

Perfectly

How do Language model compute probabilities?

P(w|h), the probability of the next word w given some history h.
P(the | its water is so transparent that)

How well did you know this?

Not at all

Perfectly

How to calculate this?

P(A/B) = P(A U B) / P(B)
joint probability:
“Out of all possible sequences of 5 words,
how many of them are ‘its water is so
transparent?”
The Chain Rule
- P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
- P(“its water is so transparent”) =
  P(its) × P(water|its) × P(is|its water) × P(so|its water is) x P(transparent|its water is so)

How well did you know this?

Not at all

Perfectly

Explain the Markov models

assumes we can predict the probability of some future unit without looking too far into the past.
P(“its water is so transparent that”) = P(the / that)
P(“its water is so transparent that”) = P(the / transparent that)

How well did you know this?

Not at all

Perfectly

explain Markov model for unigram and bigram

● Unigram Model: P ( Wi )
● BigramModel:P(Wi|Wi-1)

How well did you know this?

Not at all

Perfectly

extend Markov model to trigrams, 4-grams, 5-grams

Trigram Model:
P(wi | wi-1, wi-2) - Probability of word wi given the two previous words wi-1 and wi-2.
4-gram Model:
P(wi | wi-1, wi-2, wi-3) - Probability of word wi given the three previous words.

How well did you know this?

Not at all

Perfectly

Trigrams, 4-grams, or 5-grams are preferred over bi-grams for more context and improved accuracy.

How well did you know this?

Not at all

Perfectly

Log probabilities are used to avoid _____

Study These Flashcards

numerical underflow, especially when multiplying small probabilities together.

The evaluation process involves training a language model on a designated _____ and assessing its ability to distinguish between well-formed and poorly formed sentences using a separate, ________ .

Study These Flashcards

training set
unseen test set

what are the 2 Types of evaluations

Study These Flashcards

Extrinsic evaluation
Intrinsic evaluation

explain extrinsic evaluation.

Study These Flashcards

Is a method that Assesses system outputs based on their impact on the performance compared to other NLP systems.
The most effective evaluation deploys each model (A and B)
then Measure the accuracy of each model in performing the task.

what is the down side and solution to Extrinsic evaluation

Study These Flashcards

● Executing a task for multiple models can be expensive.
● Solution: Intrinsic Evaluation

explain Intrinsic Evaluation.

- Evaluates NLP system outputs against predetermined ground truth. - measures the quality of a model independently of any specific application.

Do not include _____ in the training set to prevent bias.

test sentences

Repeatedly using the same test set may lead to tuning models to its specific characteristics.

How can we Ensure efficient evaluation while maintaining the ability to detect statistically significant differences between models.

Select the smallest test set that provides sufficient statistical power.

explain Perplexity

● Metric: Is used instead of raw probability for evaluating language models. ● Definition: Inverse probability of the test set normalized by the number of words.

Higher conditional probability = _______

lower perplexity.

Recognizing digits ( 0-9 ) with equal probability results in perplexity of ____

of 10. Because probability would be 1/10.

Improvement in perplexity does not guarantee enhancement in extrinsic tasks.

Statistical models, particularlly N-grams only work well for word prediction if _______

the test corpus looks like the training corpus - Shakespeare and WSJ.

______ Happens when the models are too closely tailored to the training corpus, making them less adaptable to diverse datasets.

Overfitting

______ is the solution to Overfitting?

Generalization

How to achieve Generalization ?

1. Choose a training corpus that aligns with the genre of the task. 2. Consider the appropriate dialect or variety, especially in processing social media or spoken language. 3. Matching genres and dialects is not enough; models can still face sparsity challenges. - Modify the Maximum Likelihood Estimation (MLE) method. - Assign some non-zero probability to any N-gram, even if it was not observed in training.

______ refers to modifications addressing poor estimates in small data sets.

Zeros – Smoothing

explain Zeros – Smoothing

- makes the probability distribution less jagged by shaving a little bit of probability mass from the higher counts, and piling it instead on the zero counts

explain Laplace Smoothing

Adds one to all counts in the bigram matrix. P(w/h) = c(h U w) + 1 / c(h) + v - v = vocab size

Add-1 estimation is considered a blunt instrument.

add-1 is employed for smoothing in ____ and _______ .

- text classification - domains with fewer zero counts

explain Backoff:

Uses trigram if evidence is strong, otherwise it backs off and uses the bigram, and finally falls back to the unigram. - Sometimes it helps to use less context.

______ mixes unigram, bigram, and trigram.

Interpolation - adds

out of back off and interpolation ______ is the preferred approach.

interpolation

what are the 2 types of interpolation?

- linear interpolation - conditional interpolation

Lambdas are trained on the previous 2 words in _______ interpolation?

conditional

_______ are common In open vocabulary tasks?

unknown words (Out Of Vocabulary - OOV)

How do we Handle Unknown Words?

○ Introduce an unknown word token . ○ Create a fixed lexicon L of size V. ○ During text normalization, replace any training word not in L with . ○ Train probabilities like a regular word. ○ At decoding time, use probabilities for any word not in training.

ch 2 Flashcards

(48 cards)