C7 Flashcards
Recurrent Neural Networks (RNN)
- connections between the hidden layers of subsequent ‘time steps’ (words in a text)
- internal state that is updated in every time step
- hidden layer weights determine how the network should make use of past context in calculating the output for the current input (trained via backpropagation)
LSTM
Long Short-Term Memory: more powerful (and more complex) RNNs that take longer contexts into account by removing information no longer needed from the context and adding information likely to be needed for later decision making
bi-LSTM
bidirectional neural model for NER:
- first, word and character embeddings are computed for input word w_i and the context words
- these are passed through a bidirectional LSTM, whose outputs are concatenated to produce a single output layer at position i
Simplest approach: direct pass to softmax layer to choose tag t_i
But for NER the softmax approach is insufficient: strong constraints for neighboring tokens needed (e.g., the tag I-PER must follow I-PER or B-PER) => Use CRF layer on top of the bi-LSTM output: biLSTM-CRF
transformer models
- encoder-decoder architecture
- much more efficient than Bi-LSTMs and other RNNs because input is processed in parallel instead of sequentially
- can model longer-term dependencies because the complete input is processed at once
- but it uses a lot of memory because of quadratic complexity: O(n^2) for input length of n items
the attention mechanism
When processing each item in the input, the model has access to all of the input items
Self-attention: each input token is compared to all other input
tokens
=> comparison: dot product of each two vectors (the larger the value the more similar the vectors that are being compared)
- Self-attention represents how words contribute to the representation of longer inputs and how strongly words are related to each other => allows us to model longer-distance relations between words
Disadvantage: attention is quadratic in the length of the input (computing dot products between each pair of tokens in the input at each layer)
BERT
Pre-training of Deep Bidirectional Transformers for Language
Understanding
- Pre-training: language modelling
- Bidirectional: predicting randomly masked words in context
- Transformers: efficient neural architectures with self-attention
- Language understanding: encoding, not decoding (not generation)
endoder-half of the transformer
Core idea of BERT: self-supervised pretraining based on language modelling
masked language modelling
- Predicting randomly masked words in context to capture the meaning of words
- Next-sentence classification to capture the relationship between sentences
both are trained in parallel
WordPiece
specific type of tokenization used by BERT
Fixed-size vocabulary is defined to model huge corpora
The WordPiece vocabulary is optimized to cover as many words as possible
- frequent words are single tokens, e.g. “walking” and “talking”
- less frequent words are split into subwords, e.g. “bi” + “##king”, “bio” + “##sta” + “##tist” + “##ics”
- this is not linguistically motivated, but purely computationally
success of BERT
- achieves state-of-the-art results on a large range of tasks and even in a large range of domains
- pre-trained models can easily be fine-tuned
- pre-trained models are available for many languages, as well as domain-specific pre-trained BERT models: bioBERT etc.
BERT for similarity
With BERT, if we want to compute the similarity (or some other relation) between two sentences, we concatenate them in the input and then feed them to the BERT encoder
Finding the most similar pair in a collection of 10,000 sentences takes about 65 hours with BERT.
SBERT
- independent encoding of two sentences with a BERT encoder
- then measure similarity between the two embeddings
=> reduces the effort for finding the most similar pair from 65 hours with BERT to about 5 seconds with SBERT, while maintaining the accuracy from BERT
transfer learning with neural language models
Inductive transfer learning: transfer the knowledge from pretrained language models to any NLP task
- During pre-training, the model is trained on unlabeled data (selfsupervision) over different pre-training tasks
- For finetuning, the BERT model is first initialized with the pre-trained parameters
- All of the parameters are fine-tuned using labeled data from the downstream tasks (supervised learning)
Each downstream task has separate fine-tuned models, even
though they are initialized with the same pre-trained parameters.
zero-shot use
using a pre-trained model without fine-tuning
We also use the term ‘zero-shot’ for the use of models that were fine-tuned by someone else or on a different task, eg.
- trained on newspaper benchmark, applied to twitter data
- trained on English, used for Dutch
few-shot learning
fine-tuning with a small number of samples
challenges of state-of-the art methods
time and memory expensive:
- pre-training takes time (days) and computing power
- fine-tuning takes time (hours) and computing power
- use of a fine-tuned model (inference) needs computing power
Hyperparameter tuning:
- optimization on development set takes time
- adoption of hyperparameters from pre-training task might be suboptimal