ANN Lecture 9 - Neural Language Models Flashcards
Statistical Language Model
- A statistical language model defines a probability distribution over sequences of words.
- It is always based upon a text corpus (e.g. the whole Wikipedia)
Why do we need Statistical Language Models?
This can be helpful for speech recognition as there is always noise in an audio signal.
Applications for Statistical Language Model
- Speech recognition
- Machine translation
- handwritten texts recognition
How can we express the probability of a sequence of words?
- Sentences are treated as sequence of words. We can express the probability of a sequence as a product of conditional probabilities.
How can we measure the conditional probability of a word given a sequence of previous words?
We divide the number of times the previous sequence is followed by the word, by the number of times the previous sequence is found in the text corpus.
What is a n-gram model?
In a n-gram model, we make the assumption, that a word is only dependent on the (n-1) previous words.
What is the problem of a classical statistical language model?
- A classical statistical language model would assign a probability of 0 to sentences which are plausible but don’t come up in the original corpus.
- -> It has no mean to generalize to unseen sentences.
- All words are equally different to one another (similar to one-hot-encoding)
Word Embeddings
- Embeds words into a high-dimensional vector space, that captures the semantic and syntactic relevant information.
- Use ANNs to embed the words in to the right embedding space
Word Embeddings Examples (Semantic & Syntactic)
Examples:
vector(Rome) - vector(Italy) + vector(France) = vector(Paris)
vector(walked) - vector(walking) + vector(climbing) = vector(climbed)
Unsupervised Learning
- Just data, no labels
- Goal is to find structure inside data; represent the data in a lower dimensional feature space
- Still can use gradient descent; the target values are not labels but part of the data itself
CBOW-Model (Continuous Bag of Words)
Hypothesis:
Similar words occur in similar context
–>Predicts a word, given it’s context.
CBOW-Model - Representing Words
The text input is transformed into one-hot vectors of the length, number of different words in the text.
CBOW-Model - Notation
Context/ Input-Layer:
one-hot vector [1xN]
Embedding-/ weight-matrix:
look up matrix for the embeddings of the words [NxP]
Embedding-/ Hidden-Layer:
Layer containing information about the input [1xP]
Scoring-/ weight-matrix:
translation matrix to translate the embbeding into a score vector [PxN]
Prediction-/ Output-Layer:
Softmax Activation of the score vector [1xN]
CBOW-Model - Forward Step
- Calculating the embedding for all context words:
Embedding_i = Embedding-Matrix*ContextWord_i - AveragedEmbedding = SumOver_i(Embedding_i)/Numbers of Embeddings
- Calculate the score vector:
ScoreVector = ScoringMatrix*AveragedEmbedding - Calculate the output (predictions):
Output = softmax(ScoreVector)
CBOW-Model - Training
- Compare prediction to the target:
Output == Target - Calculate the cross-entropy loss
- Use gradient descent to minimize the loss.
Skip-Gram Model
- The Skip-Gram Model is a flipped version of the CBOW-Model and works better.
- It predicts a context word given a single word.
Skip-Gram Model - Forward step
- Calculating the embeddings for the words:
Embeddings = Embedding-Matrix*Words - Calculate the score vector:
ScoreVector = ScoringMatrix*Embeddings - Calculate the output (predictions):
Output = softmax(ScoreVector)
Skip-Gram Model - Training
- Compare prediction to the target (one-hot of context) (Per word there are several different training samples with different contexts which can be processed independently):
Output == Target - Calculate the cross-entropy loss
- Use gradient descent to minimize the loss.
Problem of CBOW and Skip-Gram
In both models we use the softmax to calculate the output. This involves summing up over the scores (logits) of all words. Since there can be multiple hundreds of thousands of words, this makes these models extremely slow.
Negative Sampling - Idea
- Predicting for all input-label pairs if it is a “correct” training sample, i.e. a word in the context of the word. (Using logistic function)
- Maximize the correct samples
- Minimize some wrong samples