nlp Flashcards
How big are SOTA NLP models in terms of parameters?
For instance, models like BERT Large (Devlin
et al., 2019), GPT 2 (Radford et al., 2019), Megatron (Shoeybi et al., 2019) and T5 (Raffel et al.,
2019) have 340M, 1.5B, 8.3B and 11B parameters respectively.
What is Knowledge distillation (general intro )
Knowledge distillation (Hinton et al., 2015; Ba
and Caruana, 2014) earlier used in computer vision
provides one of the techniques to compress huge
neural networks into smaller ones. In this, shallow
models (called students) are trained to mimic the
output of huge models (called teachers) based on a
transfer set. Similar approaches have been recently
adopted for language model distillation
What are some problems with current transfer Knowledge techniques?
these
methods are constrained by architectural considerations like embedding dimension in BERT and transformer architecture
Additionally, most of the above works are
geared for distilling language models for GLUE
tasks
Possible solutions extreme distill
Some concurrent works (Turc et al.,
2019); (Zhao et al., 2019) adopt pre-training or dual
training to distil students of arbitrary architecture.
However, pre-training is expensive in terms of time
and computational resources
Solutions: extreme distill https://www.aclweb.org/anthology/2020.acl-main.202.pdf
What is the GLUE benchmark?
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:
A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty
For example COLA etc
https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/edit#gid=0
What is Squad 2.0
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
In the context of Knowledge distillation, what is a Teaching assistant?
Following Mirzadeh et al. (2019), we introduce a teacher
assistant (i.e., intermediate-size student model) to further
improve the model performance of smaller students.
Assuming the teacher model consists of L-layer Transformer
with dh hidden size, the student model has M-layer Transformer with dh
hidden size. For smaller students (M <1/2L,
d<1/2d)
dh), we first distill the teacher into a teacher assistant
with L-layer Transformer and d’h
hidden size (the size of the student). The assistant
model is then used as the teacher to guide the training of the
final student.
What is the size of BERT Base ?
BERTBASE (Devlin et al., 2018) is a 12-layer Transformer
with 768 hidden size, and 12 attention heads, which contains
about 109M parameters.
How is the word tokenization used in BERT called?
Word Piece
What are possible tokenization algorithm?
Sentencepiece which combine BPE and Wordpiece
What is the order of mag of the vocabulary size of BERTbase and the sequence leght accepted?
The vocabulary size is 30, 522. The
maximum sequence length is 512.
what is the architecture of BertLarge?
BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
What are the benefit in production to have a multilingual model ?
1) you have less error and less critical ( privacy) data used because you can leverage transfer learning.
2) improve production by reducing team size and data size
In terms of decoder/encoder architecture what are the 3 possible types give an example for each
1) decoder only (GPT)
2) encoder/decoder (T5)
1) Encoder Only like BERT
Is BERT encoder/decoder or encoder/decoder
Encoder
Is GPT encoder/decoder or encoder/decoder
decoder
Is T5 encoder/decoder or encoder/decoder
encoder/decoder
How is GPT and decoder only architecture trained and what are their main application?
is trained to generate the next words given the previous ones. (autoregressive decoding)
Great for all the generation tasks.
Are decoder model like GPT SOTA for NLU tasks without fine tuning ?
No BERT (encoder only) is
How is BERT and encoder only architecture trained and what are their main application?
Trained as masked language model:
used a lot for machine translation, great to get embeddings and the decoder itself is very versatile for transfer learning/
Do transformer have the idea of locational awareness with the tokens they get as an input?
No, that is why you have Position embeddings. LSTM do have positional awareness.
Do LSTM have the idea of locational awareness with the tokens they get as an input?
Yes contrary to transformer they have.
Why do distillation work so well?
The idea is that Language modelling and language generation is a way harder task than other NLU tasks like NER or sentiment analysis so once you learn those patterns you are in a great place
What was the problem with old word embeddings like Word2Vec Glove and fast Text?
They are not context awar so the word has the same representation no matter how it is used: river bank vs financial bank
They have 1 vector for word: but words have several dimensions/aspects: semantics connotations etc
What is Multi-task learning and is it more effective than just doing single task learning? when?
Pre-training + Massive Multi-tasking 💑
Multi-task learning (MTL), training a model on several tasks at once and sharing information is a general method that is fundamental to training neural networks. Rich Caruana’s 1997 paper is one of the best introductions to this topic and as relevant today as it was back then. For more recent overviews, you can check out my survey from 2017 or a survey from 2020 that I enjoyed.
Research in multi-task learning has long shown that models trained on many tasks learn representations that generalize better to new ones. A common problem in multi-task learning, however, is minimizing negative transfer, i.e. how to make sure that tasks that are dissimilar do not hurt each other.
In recent years despite much work on alternative training objectives, the NLP community has gravitated to a single pre-training objective to rule them all, masked language modelling (MLM). Much recent work has focused on ways to adapt and improve it (e.g., Levine et al., 2021). Even the next-sentence-prediction objective used in BERT has become slowly phased out (Aroca-Ouellette & Rudzicz, 2020).
Recently, there has been a flurry of papers that show not only that multi-task learning helps pre-trained models, but that gains are larger when more tasks are used. Such massive multi-task learning settings cover up to around 100 tasks, going beyond earlier work that covered around 50 tasks (Aghajanyan et al., 2021).
A key reason for this convergence of papers is that multi-task learning is much easier with recent models, even across many tasks. This is due to the fact that many recent models such as T5 and GPT-3 use a text-to-text format. Gone are the days of hand-engineered task-specific loss functions for multi-task learning. Instead, each task only needs to be expressed in a suitable text-to-text format and models will be able to learn from it, without any changes to the underlying model.
The newly proposed approaches differ in terms of how and when multi-task learning is applied. One choice is fine-tuning an existing pre-trained model on a collection of multiple tasks, i.e. behavioural fine-tuning. This is done by T0 (Sanh et al., 2021), one of the first outcomes of the BigScience workshop, using T5 and FLAN (Wei et al., 2021) using a GPT-3-like pre-trained model. Both papers describe a unified template and instruction format into which they convert existing datasets. BigScience open-sources their collection of prompts here. Both papers report large improvements in terms of zero-shot and few-shot performance compared to state-of-the-art models like T5 and GPT-3.
Min et al. (2021) propose a different fine-tuning setting that optimizes for in-context learning: instead of fine-tuning a model on examples of a task directly, they provide the concatenation of k+1 examples to a model as input x_1, y_1, …, x_k, y_k, x_{k+1} and train the model to predict the label of the k+1-th example, y_{k+1}. They similarly report improvements in zero-shot transfer.
In contrast to the previous approaches, ExT5 (Anonymous et al., 2021) pre-trains a model on a large collection of tasks. They observe that using multiple tasks during pre-training is better than during fine-tuning and that multi-task pre-training combined with MLM is significantly more sample-efficient than just using MLM (see below).
SuperGLUE score of ExT5-LARGE vs T5-LARGE as a function of number of pre-training steps
SuperGLUE score of ExT5-LARGE vs T5-LARGE as a function of number of pre-training steps
On the whole, these papers highlight the benefit of combining self-supervised pre-training with supervised multi-task learning. While multi-task fine-tuned models were always somewhat inferior to single-task models on small task collections such as GLUE—with a few exceptions (Liu et al., 2019; Clark et al., 2019)—multi-task models may soon hold state-of-the-art results on many benchmarks. Given the availability and open-source nature of datasets in a unified format, we can imagine a virtuous cycle where newly created high-quality datasets are used to train more powerful models on increasingly diverse task collections, which could then be used in-the-loop to create more challenging datasets.
In light of the increasingly multi-task nature of such models, what then does it mean to do zero-shot learning? In current training setups, datasets from certain tasks such as NLI are excluded from training in order to ensure a fair zero-shot scenario at test time. As open-source multi-task models trained on many existing tasks become more common, it will be increasingly difficult to guarantee a setting where a model has not seen examples of a similar task. In this context, few-shot learning or the full supervised setting may become the preferred evaluation paradigms.