ANN Lecture 9 - Neural Language Models Flashcards
Statistical Language Model
- A statistical language model defines a probability distribution over sequences of words.
- It is always based upon a text corpus (e.g. the whole Wikipedia)
Why do we need Statistical Language Models?
This can be helpful for speech recognition as there is always noise in an audio signal.
Applications for Statistical Language Model
- Speech recognition
- Machine translation
- handwritten texts recognition
How can we express the probability of a sequence of words?
- Sentences are treated as sequence of words. We can express the probability of a sequence as a product of conditional probabilities.
How can we measure the conditional probability of a word given a sequence of previous words?
We divide the number of times the previous sequence is followed by the word, by the number of times the previous sequence is found in the text corpus.
What is a n-gram model?
In a n-gram model, we make the assumption, that a word is only dependent on the (n-1) previous words.
What is the problem of a classical statistical language model?
- A classical statistical language model would assign a probability of 0 to sentences which are plausible but don’t come up in the original corpus.
- -> It has no mean to generalize to unseen sentences.
- All words are equally different to one another (similar to one-hot-encoding)
Word Embeddings
- Embeds words into a high-dimensional vector space, that captures the semantic and syntactic relevant information.
- Use ANNs to embed the words in to the right embedding space
Word Embeddings Examples (Semantic & Syntactic)
Examples:
vector(Rome) - vector(Italy) + vector(France) = vector(Paris)
vector(walked) - vector(walking) + vector(climbing) = vector(climbed)
Unsupervised Learning
- Just data, no labels
- Goal is to find structure inside data; represent the data in a lower dimensional feature space
- Still can use gradient descent; the target values are not labels but part of the data itself
CBOW-Model (Continuous Bag of Words)
Hypothesis:
Similar words occur in similar context
–>Predicts a word, given it’s context.
CBOW-Model - Representing Words
The text input is transformed into one-hot vectors of the length, number of different words in the text.
CBOW-Model - Notation
Context/ Input-Layer:
one-hot vector [1xN]
Embedding-/ weight-matrix:
look up matrix for the embeddings of the words [NxP]
Embedding-/ Hidden-Layer:
Layer containing information about the input [1xP]
Scoring-/ weight-matrix:
translation matrix to translate the embbeding into a score vector [PxN]
Prediction-/ Output-Layer:
Softmax Activation of the score vector [1xN]
CBOW-Model - Forward Step
- Calculating the embedding for all context words:
Embedding_i = Embedding-Matrix*ContextWord_i - AveragedEmbedding = SumOver_i(Embedding_i)/Numbers of Embeddings
- Calculate the score vector:
ScoreVector = ScoringMatrix*AveragedEmbedding - Calculate the output (predictions):
Output = softmax(ScoreVector)
CBOW-Model - Training
- Compare prediction to the target:
Output == Target - Calculate the cross-entropy loss
- Use gradient descent to minimize the loss.