Probability and Language Modelling Flashcards

Question 1

Q

How can you generally obtain better language models? (3)

Answer

A

More training data: This is limited by resources/money/availability

Better models: This is limited by either one’s own knowledge or one’s resources (hardware for example)

Better estimation method: This is limited by either one’s own knowledge or one’s resources (hardware for example)

Question 2

Q

What is a language model?

Answer

A

A model that assigns a probability to the occurrence of sentences.

Question 3

Q

How many parameters does a mixture of two Gaussians have?

Answer

A

5, 2 for the means, 2 for the variances and 1 for the probability of assigning a value to one of the groups (one of the Gaussians)

Question 4

Q

What is a model’s structure (or form)? What is a model’s parameters?

Answer

A

Structure: the form of the equations,usually determined by knowledge about the problem. Parameters: Specific values set to variables in the model’s formulas that affect its behaviour/shape

Question 5

Q

Relative Frequency Estimation formula

Answer

A

P(x) = count(x)/N where N is the total number of items in the dataset.

Question 6

Q

What is the likelihood?

Answer

A

The probabiltiy of observing a given set of data from a distribution with parameters θ L = P(x|θ) = L(θ|x)

Question 7

Q

Why is trying to use maximum-likelihood estimation to produce probabilities for each item in a large dataset difficult?

Answer

A

Each probability will be close to 0 since the normalisation term will be so large. This creates a sparse data problem.

Question 8

Q

What is a “bag-of-words” model?

Answer

A

A unigram model.

Question 9

Q

What assumption is necessary for the validity of an n-gram model?

Answer

A

The markov assumption up to n “the future word only depends on n-1 previous words”

Question 10

Q

What is the noisy-channel framework?

Answer

A

A framework that assumes that some input Y is affected by some noise/errorful encoding P(X|Y) and produces an output X. So given some output X we want to find a Y that is most like the original input. argmax_y(P(Y|X))

Question 11

Q

How does the entropy of a non-uniform distribution over N outcomes compare to that of a uniform distribution?

Answer

A

Any non-uniform distribution over N outcomes has lower entropy than the corresponding uniform distribution

Question 12

Q

Disadvantages of Language Models (3)

Answer

A

1) No way to share “information” about similar words. e.g.: The probabilities for “She bought a car” and “She bought a bicycle” would have to be calculated from scratch despite thier obvious similarities.

(Solved by class-based language models - model classes of words instead of the words themselves and calculate probabilities of classes instead of words)

2) No way to condition on context with intervening words. e.g. Dr. Gertrude Smith vs Dr. Jane Smith. You would have to use a unigram to calculate “Dr. Gertrude”, “Dr Jane” , “Gertrude Smith” and “Jane Smith”.

(Skip-Grams are a solution to this)

3) Can not handle long-distance dependencies since we limit the context to n-1 words.

Question 13

Q

What’s a parameter?

Answer

A

A variable in the equation that specifies a distribution - it is often determine by the observed data in probability estimation.

Question 14

Q

Given that X is an output sequence and Y is an input sequence.

In language modelling,

1) how is the language model defined?
2) how is the noise model defined?

3) what do we want to obtain

Answer

A

Since there is noise added to the input sequence we know that P(X) ≠ P(Y), the noise model is defined as P(X|Y)

The language model is estimated as P(Y)

We want to know the most likely input sequence so we want to find P(Y|X)

We want to find Y that maximises P(Y|X)

Question 15

Q

For a particular input y and output x, how do we find P(y|x)?

Answer

A

Using Bayes rule, we want to find the value of y that maximises P(y|x) = P(x|y)P(y)

Question 16

Q

Does the noise-model P(x|y) depend on the task?

Does the language model P(y) depend on the task?

Answer

Study These Flashcards

A

The noise model depends on the task given

The language model is task independent.

Question 17

Q

What is the noise-model trained on?
What is the languge models trained on?

Answer

Study These Flashcards

A

Corpus data for both

Question 18

Q

If we can train P(X|Y), why can’t we just train P(Y|X)? (2)

Answer

Study These Flashcards

A

We can often do so
Training P(X|Y) and P(Y|X) requires input/output pairs which are often limited

i.e. a spellchecking system needs correcly spelt words and incorectly spelt words,

a translation system needs the translation of words or sentence

both of these need annotations and for someone to do the annotation.

Sometimes we have this data, sometimes we don’t!

Language models can be trained on unannotated data - raw text is easy to get unlike annotated data

Question 19

Q

Give a con and pro of a uni-gram language model

Answer

Study These Flashcards

A

Gives the same probabiltiy to sentences that contain the same words regardless of the order.
Good for finding the general “aboutness” of an entire document since it will produce the most likely words from that document, you can guess roughly what the document is about.

Question 20

Q

What tasks are bag-of-words (uni-gram) language models good for?

Answer

Study These Flashcards

A

Tasks like lexical disambiguation, text-classification and information retrieval

Question 21

Q

What is the independence assumption in a tri-gram and bi-gram model?

Answer

Study These Flashcards

A

tri-gram: P(wi|wi-1,wi-2), the probability of a word wi is only dependent on the previous two words

bi-gram: P(wi|wi-1), the probability of a word wi is only dependent on the previous word

Question 22

Q

How do we fix the problem of indicating whether a word makes sense at the beginning or end of a sentence?

Answer

Study These Flashcards

A

Use end-of-sentence and start-of-sentence tags - and

Probability and Language Modelling Flashcards

(22 cards)