ANN Lecture 9 - Neural Language Models Flashcards

1
Q

Statistical Language Model

A
  • A statistical language model defines a probability distribution over sequences of words.
  • It is always based upon a text corpus (e.g. the whole Wikipedia)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why do we need Statistical Language Models?

A

This can be helpful for speech recognition as there is always noise in an audio signal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Applications for Statistical Language Model

A
  • Speech recognition
  • Machine translation
  • handwritten texts recognition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can we express the probability of a sequence of words?

A
  • Sentences are treated as sequence of words. We can express the probability of a sequence as a product of conditional probabilities.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can we measure the conditional probability of a word given a sequence of previous words?

A

We divide the number of times the previous sequence is followed by the word, by the number of times the previous sequence is found in the text corpus.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a n-gram model?

A

In a n-gram model, we make the assumption, that a word is only dependent on the (n-1) previous words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the problem of a classical statistical language model?

A
  • A classical statistical language model would assign a probability of 0 to sentences which are plausible but don’t come up in the original corpus.
  • -> It has no mean to generalize to unseen sentences.
  • All words are equally different to one another (similar to one-hot-encoding)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Word Embeddings

A
  • Embeds words into a high-dimensional vector space, that captures the semantic and syntactic relevant information.
  • Use ANNs to embed the words in to the right embedding space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Word Embeddings Examples (Semantic & Syntactic)

A

Examples:
vector(Rome) - vector(Italy) + vector(France) = vector(Paris)

vector(walked) - vector(walking) + vector(climbing) = vector(climbed)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unsupervised Learning

A
  • Just data, no labels
  • Goal is to find structure inside data; represent the data in a lower dimensional feature space
  • Still can use gradient descent; the target values are not labels but part of the data itself
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

CBOW-Model (Continuous Bag of Words)

A

Hypothesis:
Similar words occur in similar context
–>Predicts a word, given it’s context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

CBOW-Model - Representing Words

A

The text input is transformed into one-hot vectors of the length, number of different words in the text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CBOW-Model - Notation

A

Context/ Input-Layer:
one-hot vector [1xN]

Embedding-/ weight-matrix:
look up matrix for the embeddings of the words [NxP]

Embedding-/ Hidden-Layer:
Layer containing information about the input [1xP]

Scoring-/ weight-matrix:
translation matrix to translate the embbeding into a score vector [PxN]

Prediction-/ Output-Layer:
Softmax Activation of the score vector [1xN]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CBOW-Model - Forward Step

A
  1. Calculating the embedding for all context words:
    Embedding_i = Embedding-Matrix*ContextWord_i
  2. AveragedEmbedding = SumOver_i(Embedding_i)/Numbers of Embeddings
  3. Calculate the score vector:
    ScoreVector = ScoringMatrix*AveragedEmbedding
  4. Calculate the output (predictions):
    Output = softmax(ScoreVector)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

CBOW-Model - Training

A
  1. Compare prediction to the target:
    Output == Target
  2. Calculate the cross-entropy loss
  3. Use gradient descent to minimize the loss.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Skip-Gram Model

A
  • The Skip-Gram Model is a flipped version of the CBOW-Model and works better.
  • It predicts a context word given a single word.
17
Q

Skip-Gram Model - Forward step

A
  1. Calculating the embeddings for the words:
    Embeddings = Embedding-Matrix*Words
  2. Calculate the score vector:
    ScoreVector = ScoringMatrix*Embeddings
  3. Calculate the output (predictions):
    Output = softmax(ScoreVector)
18
Q

Skip-Gram Model - Training

A
  1. Compare prediction to the target (one-hot of context) (Per word there are several different training samples with different contexts which can be processed independently):
    Output == Target
  2. Calculate the cross-entropy loss
  3. Use gradient descent to minimize the loss.
19
Q

Problem of CBOW and Skip-Gram

A

In both models we use the softmax to calculate the output. This involves summing up over the scores (logits) of all words. Since there can be multiple hundreds of thousands of words, this makes these models extremely slow.

20
Q

Negative Sampling - Idea

A
  • Predicting for all input-label pairs if it is a “correct” training sample, i.e. a word in the context of the word. (Using logistic function)
  • Maximize the correct samples
  • Minimize some wrong samples