Probability and Language Modelling Flashcards

1
Q

How can you generally obtain better language models? (3)

A

More training data: This is limited by resources/money/availability

Better models: This is limited by either one’s own knowledge or one’s resources (hardware for example)

Better estimation method: This is limited by either one’s own knowledge or one’s resources (hardware for example)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a language model?

A

A model that assigns a probability to the occurrence of sentences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How many parameters does a mixture of two Gaussians have?

A

5, 2 for the means, 2 for the variances and 1 for the probability of assigning a value to one of the groups (one of the Gaussians)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a model’s structure (or form)? What is a model’s parameters?

A

Structure: the form of the equations,usually determined by knowledge about the problem. Parameters: Specific values set to variables in the model’s formulas that affect its behaviour/shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Relative Frequency Estimation formula

A

P(x) = count(x)/N where N is the total number of items in the dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the likelihood?

A

The probabiltiy of observing a given set of data from a distribution with parameters θ L = P(x|θ) = L(θ|x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is trying to use maximum-likelihood estimation to produce probabilities for each item in a large dataset difficult?

A

Each probability will be close to 0 since the normalisation term will be so large. This creates a sparse data problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a “bag-of-words” model?

A

A unigram model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What assumption is necessary for the validity of an n-gram model?

A

The markov assumption up to n “the future word only depends on n-1 previous words”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the noisy-channel framework?

A

A framework that assumes that some input Y is affected by some noise/errorful encoding P(X|Y) and produces an output X. So given some output X we want to find a Y that is most like the original input. argmaxy(P(Y|X))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the entropy of a non-uniform distribution over N outcomes compare to that of a uniform distribution?

A

Any non-uniform distribution over N outcomes has lower entropy than the corresponding uniform distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Disadvantages of Language Models (3)

A

1) No way to share “information” about similar words. e.g.: The probabilities for “She bought a car” and “She bought a bicycle” would have to be calculated from scratch despite thier obvious similarities.

(Solved by class-based language models - model classes of words instead of the words themselves and calculate probabilities of classes instead of words)

2) No way to condition on context with intervening words. e.g. Dr. Gertrude Smith vs Dr. Jane Smith. You would have to use a unigram to calculate “Dr. Gertrude”, “Dr Jane” , “Gertrude Smith” and “Jane Smith”.

(Skip-Grams are a solution to this)

3) Can not handle long-distance dependencies since we limit the context to n-1 words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What’s a parameter?

A

A variable in the equation that specifies a distribution - it is often determine by the observed data in probability estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given that X is an output sequence and Y is an input sequence.

In language modelling,

1) how is the language model defined?
2) how is the noise model defined?

3) what do we want to obtain

A

Since there is noise added to the input sequence we know that P(X) ≠ P(Y), the noise model is defined as P(X|Y)

The language model is estimated as P(Y)

We want to know the most likely input sequence so we want to find P(Y|X)

We want to find Y that maximises P(Y|X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For a particular input y and output x, how do we find P(y|x)?

A

Using Bayes rule, we want to find the value of y that maximises P(y|x) = P(x|y)P(y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Does the noise-model P(x|y) depend on the task?

Does the language model P(y) depend on the task?

A

The noise model depends on the task given

The language model is task independent.

17
Q

What is the noise-model trained on?
What is the languge models trained on?

A

Corpus data for both

18
Q

If we can train P(X|Y), why can’t we just train P(Y|X)? (2)

A
  1. We can often do so
  2. Training P(X|Y) and P(Y|X) requires input/output pairs which are often limited

i.e. a spellchecking system needs correcly spelt words and incorectly spelt words,

a translation system needs the translation of words or sentence

both of these need annotations and for someone to do the annotation.

  • Sometimes we have this data, sometimes we don’t!

Language models can be trained on unannotated data - raw text is easy to get unlike annotated data

19
Q

Give a con and pro of a uni-gram language model

A
  1. Gives the same probabiltiy to sentences that contain the same words regardless of the order.
  2. Good for finding the general “aboutness” of an entire document since it will produce the most likely words from that document, you can guess roughly what the document is about.
20
Q

What tasks are bag-of-words (uni-gram) language models good for?

A

Tasks like lexical disambiguation, text-classification and information retrieval

21
Q

What is the independence assumption in a tri-gram and bi-gram model?

A

tri-gram: P(wi|wi-1,wi-2), the probability of a word wi is only dependent on the previous two words

bi-gram: P(wi|wi-1), the probability of a word wi is only dependent on the previous word

22
Q

How do we fix the problem of indicating whether a word makes sense at the beginning or end of a sentence?

A

Use end-of-sentence and start-of-sentence tags - and