Contextualized Word Embeddings Flashcards
Static embeddings vs. contextualized embeddings
Word embeddings such as word2vec or GloVe learn a single vector for each type (unique word) in the vocabulary V. These are also called static embeddings.
By contrast, contextualized embeddings represent each token (word occurrence) by a different vector, according to the context the token appears in (also called pre-trained models, dynamic embeddings, or large language models).
How can we learn to represent words along with the context they occur in?
We train a neural network using ideas from language models:
- predict a word from left context (GPT-3)
- predict a word from left and right context independently (ELMo)
- predict a word from left and right context jointly (BERT)
ELMo (Embeddings from Language Model)
ELMo looks at the entire sentence before assigning each word its embedding.
- Character-level tokens are processed by a CNN (convolutional neural network) producing word-level embeddings.
- These embeddings are processed by a left-to-right and a right-to-left 2-layers LSTM.
- For each word, the output embeddings at each layer (including the CNN layer) are combined, producing contextual embeddings.
think about the structure…
BERT (Bidirectional Encoder Representations from Transformers)
BERT produces word representations by jointly conditioning on both left and right context (Created by researchers at Google AI Language) thanks to the self-attention mechanism that ranges over the entire input.
The model is based on the encoder component of the well-known Transformer neural network.
- Input is segmented using subword tokenization and combined with positional embeddings
- Input is passed through a series of standard transformer blocks consisting of self-attention and feedforward layers, augmented with residual connections and layer normalization.
think about the structure…
How BERT is pretrained?
The pretraining is performed by running 2 unsupervised learning tasks simultaneously:
- Masked language modeling
- Next sentence prediction
The result of the these 2 pre-training processes consists of the parameters of the encoder, which is used to produce contextual embeddings for novel sentences or sentence pairs.
BERT 1st learning goal: Masked language modeling
The model learns to perform a fill-in-the-blank task, technically called the cloze task.
A random sample of 15% of the input tokens are:
- replaced with the unique vocabulary token [MASK] (80%)
- replaced with another token randomly sampled with unigram probabilities (10%)
- left unchanged (10%)
BERT 2nd learning goal: Next sentence prediction, input and output type
NSP training is used for tasks involving relationship between pairs of sentences, such as:
- paraphrase detection: detecting if two sentences have similar meanings
- sentence entailment: detecting if the meanings of two sentences entail or contradict each other
- discourse coherence: deciding if two neighboring sentences form a coherent discourse
INPUT:
In NSP, the model is presented with pairs of sentences with:
- token [CLS] prepended to the first sentence
- token [SEP] placed between the two sentences and after the rightmost token of the second sentence
- segment embeddings marking the first and second sentences are added to each word and positional embeddings (remember the image)
OUTPUT:
- The output embedding associated with the [CLS] token represents the next sentence prediction.
GPT-n (Generative Pre-Training for language understanding)
Used for learning contextualized word embeddings.
Is a left-to-right language model based on Transformer’s decoder.
GPT-n can be used for:
- token prediction / generation (LM)
- sequence labelling
- single sentence classification
- sentence pairs classification
Adaptation
To make practical use of contextualized embeddings, we need to interface these models with downstream applications.
This process is called adaptation, and uses labeled data for the task of interest.
Two most common forms of adaptation are:
- feature extraction: freeze the pre-trained parameters of the language model and train parameters of the model for the task at hand
-
fine-tuning: make (possibly minimal) adjustments to the
pre-trained parameters.
Adapters
When working with huge pre-trained models, fine-tuning may still be inefficient.
Alternatively, one could fix the pre-trained model, and train only small, very simple components called adapters.
With adapter modules transfer becomes very efficient: the largest part of the pre-trained model is shared between all downstream tasks.
Contextualized embeddings ethics
- Contextual language models can generate toxic language, misinformation, radicalization, and other socially harmful activities
- Contextual language models can leak information about their training data. It is possible for an adversary to extract individual data from a language model (phishing).
Unsolved research problem in NLP.