nlp Flashcards
How big are SOTA NLP models in terms of parameters?
For instance, models like BERT Large (Devlin
et al., 2019), GPT 2 (Radford et al., 2019), Megatron (Shoeybi et al., 2019) and T5 (Raffel et al.,
2019) have 340M, 1.5B, 8.3B and 11B parameters respectively.
What is Knowledge distillation (general intro )
Knowledge distillation (Hinton et al., 2015; Ba
and Caruana, 2014) earlier used in computer vision
provides one of the techniques to compress huge
neural networks into smaller ones. In this, shallow
models (called students) are trained to mimic the
output of huge models (called teachers) based on a
transfer set. Similar approaches have been recently
adopted for language model distillation
What are some problems with current transfer Knowledge techniques?
these
methods are constrained by architectural considerations like embedding dimension in BERT and transformer architecture
Additionally, most of the above works are
geared for distilling language models for GLUE
tasks
Possible solutions extreme distill
Some concurrent works (Turc et al.,
2019); (Zhao et al., 2019) adopt pre-training or dual
training to distil students of arbitrary architecture.
However, pre-training is expensive in terms of time
and computational resources
Solutions: extreme distill https://www.aclweb.org/anthology/2020.acl-main.202.pdf
What is the GLUE benchmark?
The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:
A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty
For example COLA etc
https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/edit#gid=0
What is Squad 2.0
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
In the context of Knowledge distillation, what is a Teaching assistant?
Following Mirzadeh et al. (2019), we introduce a teacher
assistant (i.e., intermediate-size student model) to further
improve the model performance of smaller students.
Assuming the teacher model consists of L-layer Transformer
with dh hidden size, the student model has M-layer Transformer with dh
hidden size. For smaller students (M <1/2L,
d<1/2d)
dh), we first distill the teacher into a teacher assistant
with L-layer Transformer and d’h
hidden size (the size of the student). The assistant
model is then used as the teacher to guide the training of the
final student.
What is the size of BERT Base ?
BERTBASE (Devlin et al., 2018) is a 12-layer Transformer
with 768 hidden size, and 12 attention heads, which contains
about 109M parameters.
How is the word tokenization used in BERT called?
Word Piece
What are possible tokenization algorithm?
Sentencepiece which combine BPE and Wordpiece
What is the order of mag of the vocabulary size of BERTbase and the sequence leght accepted?
The vocabulary size is 30, 522. The
maximum sequence length is 512.
what is the architecture of BertLarge?
BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
What are the benefit in production to have a multilingual model ?
1) you have less error and less critical ( privacy) data used because you can leverage transfer learning.
2) improve production by reducing team size and data size
In terms of decoder/encoder architecture what are the 3 possible types give an example for each
1) decoder only (GPT)
2) encoder/decoder (T5)
1) Encoder Only like BERT
Is BERT encoder/decoder or encoder/decoder
Encoder
Is GPT encoder/decoder or encoder/decoder
decoder
Is T5 encoder/decoder or encoder/decoder
encoder/decoder
How is GPT and decoder only architecture trained and what are their main application?
is trained to generate the next words given the previous ones. (autoregressive decoding)
Great for all the generation tasks.
Are decoder model like GPT SOTA for NLU tasks without fine tuning ?
No BERT (encoder only) is
How is BERT and encoder only architecture trained and what are their main application?
Trained as masked language model:
used a lot for machine translation, great to get embeddings and the decoder itself is very versatile for transfer learning/
Do transformer have the idea of locational awareness with the tokens they get as an input?
No, that is why you have Position embeddings. LSTM do have positional awareness.
Do LSTM have the idea of locational awareness with the tokens they get as an input?
Yes contrary to transformer they have.
Why do distillation work so well?
The idea is that Language modelling and language generation is a way harder task than other NLU tasks like NER or sentiment analysis so once you learn those patterns you are in a great place