ANN Lecture 9 - Neural Language Models Flashcards

Question 1

Q

Statistical Language Model

Answer

A

A statistical language model defines a probability distribution over sequences of words.
It is always based upon a text corpus (e.g. the whole Wikipedia)

Question 2

Q

Why do we need Statistical Language Models?

Answer

A

This can be helpful for speech recognition as there is always noise in an audio signal.

Question 3

Q

Applications for Statistical Language Model

Answer

A

Speech recognition
Machine translation
handwritten texts recognition

Question 4

Q

How can we express the probability of a sequence of words?

Answer

A

Sentences are treated as sequence of words. We can express the probability of a sequence as a product of conditional probabilities.

Question 5

Q

How can we measure the conditional probability of a word given a sequence of previous words?

Answer

A

We divide the number of times the previous sequence is followed by the word, by the number of times the previous sequence is found in the text corpus.

Question 6

Q

What is a n-gram model?

Answer

A

In a n-gram model, we make the assumption, that a word is only dependent on the (n-1) previous words.

Question 7

Q

What is the problem of a classical statistical language model?

Answer

A

A classical statistical language model would assign a probability of 0 to sentences which are plausible but don’t come up in the original corpus.
-> It has no mean to generalize to unseen sentences.
All words are equally different to one another (similar to one-hot-encoding)

Question 8

Q

Word Embeddings

Answer

A

Embeds words into a high-dimensional vector space, that captures the semantic and syntactic relevant information.
Use ANNs to embed the words in to the right embedding space

Question 9

Q

Word Embeddings Examples (Semantic & Syntactic)

Answer

A

Examples:
vector(Rome) - vector(Italy) + vector(France) = vector(Paris)

vector(walked) - vector(walking) + vector(climbing) = vector(climbed)

Question 10

Q

Unsupervised Learning

Answer

A

Just data, no labels
Goal is to find structure inside data; represent the data in a lower dimensional feature space
Still can use gradient descent; the target values are not labels but part of the data itself

Question 11

Q

CBOW-Model (Continuous Bag of Words)

Answer

A

Hypothesis:
Similar words occur in similar context
–>Predicts a word, given it’s context.

Question 12

Q

CBOW-Model - Representing Words

Answer

A

The text input is transformed into one-hot vectors of the length, number of different words in the text.

Question 13

Q

CBOW-Model - Notation

Answer

A

Context/ Input-Layer:
one-hot vector [1xN]

Embedding-/ weight-matrix:
look up matrix for the embeddings of the words [NxP]

Embedding-/ Hidden-Layer:
Layer containing information about the input [1xP]

Scoring-/ weight-matrix:
translation matrix to translate the embbeding into a score vector [PxN]

Prediction-/ Output-Layer:
Softmax Activation of the score vector [1xN]

Question 14

Q

CBOW-Model - Forward Step

Answer

A

Calculating the embedding for all context words:
Embedding_i = Embedding-Matrix*ContextWord_i
AveragedEmbedding = SumOver_i(Embedding_i)/Numbers of Embeddings
Calculate the score vector:
ScoreVector = ScoringMatrix*AveragedEmbedding
Calculate the output (predictions):
Output = softmax(ScoreVector)

Question 15

Q

CBOW-Model - Training

Answer

A

Compare prediction to the target:
Output == Target
Calculate the cross-entropy loss
Use gradient descent to minimize the loss.

Question 16

Q

Skip-Gram Model

Answer

Study These Flashcards

A

The Skip-Gram Model is a flipped version of the CBOW-Model and works better.
It predicts a context word given a single word.

Question 17

Q

Skip-Gram Model - Forward step

Answer

Study These Flashcards

A

Calculating the embeddings for the words:
Embeddings = Embedding-Matrix*Words
Calculate the score vector:
ScoreVector = ScoringMatrix*Embeddings
Calculate the output (predictions):
Output = softmax(ScoreVector)

Question 18

Q

Skip-Gram Model - Training

Answer

Study These Flashcards

A

Compare prediction to the target (one-hot of context) (Per word there are several different training samples with different contexts which can be processed independently):
Output == Target
Calculate the cross-entropy loss
Use gradient descent to minimize the loss.

Question 19

Q

Problem of CBOW and Skip-Gram

Answer

Study These Flashcards

A

In both models we use the softmax to calculate the output. This involves summing up over the scores (logits) of all words. Since there can be multiple hundreds of thousands of words, this makes these models extremely slow.

Question 20

Q

Negative Sampling - Idea

Answer

Study These Flashcards

A

Predicting for all input-label pairs if it is a “correct” training sample, i.e. a word in the context of the word. (Using logistic function)
Maximize the correct samples
Minimize some wrong samples

ANN Lecture 9 - Neural Language Models Flashcards

(20 cards)