nlp Flashcards

1
Q

How big are SOTA NLP models in terms of parameters?

A

For instance, models like BERT Large (Devlin
et al., 2019), GPT 2 (Radford et al., 2019), Megatron (Shoeybi et al., 2019) and T5 (Raffel et al.,
2019) have 340M, 1.5B, 8.3B and 11B parameters respectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Knowledge distillation (general intro )

A

Knowledge distillation (Hinton et al., 2015; Ba
and Caruana, 2014) earlier used in computer vision
provides one of the techniques to compress huge
neural networks into smaller ones. In this, shallow
models (called students) are trained to mimic the
output of huge models (called teachers) based on a
transfer set. Similar approaches have been recently
adopted for language model distillation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some problems with current transfer Knowledge techniques?

A

these
methods are constrained by architectural considerations like embedding dimension in BERT and transformer architecture

Additionally, most of the above works are
geared for distilling language models for GLUE
tasks

Possible solutions extreme distill

Some concurrent works (Turc et al.,
2019); (Zhao et al., 2019) adopt pre-training or dual
training to distil students of arbitrary architecture.
However, pre-training is expensive in terms of time
and computational resources

Solutions: extreme distill https://www.aclweb.org/anthology/2020.acl-main.202.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the GLUE benchmark?

A

The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. GLUE consists of:

A benchmark of nine sentence- or sentence-pair language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty

For example COLA etc

https://docs.google.com/spreadsheets/d/1BrOdjJgky7FfeiwC_VDURZuRPUFUAz_jfczPPT35P00/edit#gid=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Squad 2.0

A

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In the context of Knowledge distillation, what is a Teaching assistant?

A

Following Mirzadeh et al. (2019), we introduce a teacher
assistant (i.e., intermediate-size student model) to further
improve the model performance of smaller students.
Assuming the teacher model consists of L-layer Transformer
with dh hidden size, the student model has M-layer Transformer with dh
hidden size. For smaller students (M <1/2L,
d<1/2d)
dh), we first distill the teacher into a teacher assistant
with L-layer Transformer and d’h
hidden size (the size of the student). The assistant
model is then used as the teacher to guide the training of the
final student.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the size of BERT Base ?

A

BERTBASE (Devlin et al., 2018) is a 12-layer Transformer
with 768 hidden size, and 12 attention heads, which contains
about 109M parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the word tokenization used in BERT called?

A

Word Piece

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are possible tokenization algorithm?

A

Sentencepiece which combine BPE and Wordpiece

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the order of mag of the vocabulary size of BERTbase and the sequence leght accepted?

A

The vocabulary size is 30, 522. The

maximum sequence length is 512.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is the architecture of BertLarge?

A

BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
BERT Large – 24 layers, 16 attention heads and, 340 million parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the benefit in production to have a multilingual model ?

A

1) you have less error and less critical ( privacy) data used because you can leverage transfer learning.
2) improve production by reducing team size and data size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In terms of decoder/encoder architecture what are the 3 possible types give an example for each

A

1) decoder only (GPT)
2) encoder/decoder (T5)
1) Encoder Only like BERT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is BERT encoder/decoder or encoder/decoder

A

Encoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is GPT encoder/decoder or encoder/decoder

A

decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is T5 encoder/decoder or encoder/decoder

A

encoder/decoder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How is GPT and decoder only architecture trained and what are their main application?

A

is trained to generate the next words given the previous ones. (autoregressive decoding)

Great for all the generation tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Are decoder model like GPT SOTA for NLU tasks without fine tuning ?

A

No BERT (encoder only) is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is BERT and encoder only architecture trained and what are their main application?

A

Trained as masked language model:
used a lot for machine translation, great to get embeddings and the decoder itself is very versatile for transfer learning/

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Do transformer have the idea of locational awareness with the tokens they get as an input?

A

No, that is why you have Position embeddings. LSTM do have positional awareness.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Do LSTM have the idea of locational awareness with the tokens they get as an input?

A

Yes contrary to transformer they have.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why do distillation work so well?

A

The idea is that Language modelling and language generation is a way harder task than other NLU tasks like NER or sentiment analysis so once you learn those patterns you are in a great place

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What was the problem with old word embeddings like Word2Vec Glove and fast Text?

A

They are not context awar so the word has the same representation no matter how it is used: river bank vs financial bank

They have 1 vector for word: but words have several dimensions/aspects: semantics connotations etc

24
Q

What is Multi-task learning and is it more effective than just doing single task learning? when?

A

Pre-training + Massive Multi-tasking 💑
Multi-task learning (MTL), training a model on several tasks at once and sharing information is a general method that is fundamental to training neural networks. Rich Caruana’s 1997 paper is one of the best introductions to this topic and as relevant today as it was back then. For more recent overviews, you can check out my survey from 2017 or a survey from 2020 that I enjoyed.
Research in multi-task learning has long shown that models trained on many tasks learn representations that generalize better to new ones. A common problem in multi-task learning, however, is minimizing negative transfer, i.e. how to make sure that tasks that are dissimilar do not hurt each other.
In recent years despite much work on alternative training objectives, the NLP community has gravitated to a single pre-training objective to rule them all, masked language modelling (MLM). Much recent work has focused on ways to adapt and improve it (e.g., Levine et al., 2021). Even the next-sentence-prediction objective used in BERT has become slowly phased out (Aroca-Ouellette & Rudzicz, 2020).
Recently, there has been a flurry of papers that show not only that multi-task learning helps pre-trained models, but that gains are larger when more tasks are used. Such massive multi-task learning settings cover up to around 100 tasks, going beyond earlier work that covered around 50 tasks (Aghajanyan et al., 2021).
A key reason for this convergence of papers is that multi-task learning is much easier with recent models, even across many tasks. This is due to the fact that many recent models such as T5 and GPT-3 use a text-to-text format. Gone are the days of hand-engineered task-specific loss functions for multi-task learning. Instead, each task only needs to be expressed in a suitable text-to-text format and models will be able to learn from it, without any changes to the underlying model.
The newly proposed approaches differ in terms of how and when multi-task learning is applied. One choice is fine-tuning an existing pre-trained model on a collection of multiple tasks, i.e. behavioural fine-tuning. This is done by T0 (Sanh et al., 2021), one of the first outcomes of the BigScience workshop, using T5 and FLAN (Wei et al., 2021) using a GPT-3-like pre-trained model. Both papers describe a unified template and instruction format into which they convert existing datasets. BigScience open-sources their collection of prompts here. Both papers report large improvements in terms of zero-shot and few-shot performance compared to state-of-the-art models like T5 and GPT-3.
Min et al. (2021) propose a different fine-tuning setting that optimizes for in-context learning: instead of fine-tuning a model on examples of a task directly, they provide the concatenation of k+1 examples to a model as input x_1, y_1, …, x_k, y_k, x_{k+1} and train the model to predict the label of the k+1-th example, y_{k+1}. They similarly report improvements in zero-shot transfer.
In contrast to the previous approaches, ExT5 (Anonymous et al., 2021) pre-trains a model on a large collection of tasks. They observe that using multiple tasks during pre-training is better than during fine-tuning and that multi-task pre-training combined with MLM is significantly more sample-efficient than just using MLM (see below).
SuperGLUE score of ExT5-LARGE vs T5-LARGE as a function of number of pre-training steps
SuperGLUE score of ExT5-LARGE vs T5-LARGE as a function of number of pre-training steps
On the whole, these papers highlight the benefit of combining self-supervised pre-training with supervised multi-task learning. While multi-task fine-tuned models were always somewhat inferior to single-task models on small task collections such as GLUE—with a few exceptions (Liu et al., 2019; Clark et al., 2019)—multi-task models may soon hold state-of-the-art results on many benchmarks. Given the availability and open-source nature of datasets in a unified format, we can imagine a virtuous cycle where newly created high-quality datasets are used to train more powerful models on increasingly diverse task collections, which could then be used in-the-loop to create more challenging datasets.
In light of the increasingly multi-task nature of such models, what then does it mean to do zero-shot learning? In current training setups, datasets from certain tasks such as NLI are excluded from training in order to ensure a fair zero-shot scenario at test time. As open-source multi-task models trained on many existing tasks become more common, it will be increasingly difficult to guarantee a setting where a model has not seen examples of a similar task. In this context, few-shot learning or the full supervised setting may become the preferred evaluation paradigms.

25
Q

What is TF-IDF?

A

TFIDF or Term Frequency-Inverse Document Frequency indicates the importance of a word in a set. It helps in information retrieval with numerical statistics. For a specific document, TF-IDF shows a frequency that helps identify the keywords in a document. The major use of TF-IDF in NLP is the extraction of useful information from crucial documents by statistical data. It is ideally used to classify and summarize the text in documents and filter out stop words.

TF helps calculate the ratio of the frequency of a term in a document and the total number of terms. Whereas, IDF denotes the importance of the term in a document.

The formula for calculating TF-IDF:

TF(W) = (Frequency of W in a document)/(The total number of terms in the document)

IDF(W) = log_e(The total number of documents/The number of documents having the term W)

When TF*IDF is high, the frequency of the term is less and vice versa.

26
Q

What is perplexity? What is its place in NLP?

A

Perplexity is a way to express a degree of confusion a model has in predicting. More entropy = more confusion. Perplexity is used to evaluate language models in NLP. A good language model assigns a higher probability to the right prediction.

27
Q

Time complexity of LSTM

Time complexity of transformer

A

Time complexity of LSTM

seq_length*hidden²

Time complexity of transformer

seq_length²*hidden

When hidden size is more than the seq_length (which is normally the case), transformer is faster than LSTM.

28
Q

How is AdamW different from Adam?

A

AdamW is Adam with L2 regularization on weight as models with smaller weights generalize better.

29
Q

How is AdamW different from Adam?

A

AdamW is Adam with L2 regularization on weight as models with smaller weights generalize better.

30
Q

Difference between BatchNorm and LayerNorm?

A

BatchNorm — Compute the mean and var at each layer for every minibatch.

LayerNorm — Compute the mean and var for every single sample for each layer independently.

31
Q

What are the differences between BERT and ALBERT v2?

A

Embedding matrix factorization (helps in reducing no. of parameters)
No dropout
Parameter sharing (helps in reducing no. of parameters and regularization)

32
Q

Whay is the difference between roberta and BERT in terms of pretraining task?

A

RoBERTa is pretrained using only masked language modelling (MLM) while BERT model is pretrained using two pretraining tasks namely masked language modelling (MLM) and next sentence prediction (NSP).

33
Q

What are the 3 types of Self-Supervised Learning tasks in pretrain language modelling?

A

Generative SSL
Like predict/generate the next token or the masked one
Contrastive SSL
Like the next sentence prediction in BERT or the sentence order prediction in Albert (are they swapped or not?)

Adversarial SSL

Adversarial SSL allows the model to learn by identifying whether the tokens in the input sentence are replaced or shuffled or randomly substituted. Replaced token detection (RTD) in ELECTRA

34
Q

Which famous model use adversarial self supervision?

A

Electra

Adversarial SSL allows the model to learn by identifying whether the tokens in the input sentence are replaced or shuffled or randomly substituted. Replaced token detection (RTD) in ELECTRA [5], shuffled token detection (STD) [55] and random token substitution (RTS) [56] are examples of Adversarial SSL

35
Q

What are some ballpark sizes of big language models vocabulary?

A

For example, BERT uses WordPiece vocabulary of size around 30K, RoBERTa uses bBPE vocabulary of size around 50K, XLM [63] uses BPE vocabulary of size 95K, mBERT [2] WordPiece vocabulary of size 110K, XLM-R [64], and mBART [65] uses SentencePiece vocabulary of size 250K.

Multilingual –> bigger vocabulary

36
Q

How big are in GB
Wikipedia book corpus
CC100
and C4

A

BERT model is pretrained using text from Wikipedia and BookCorpus which amounts to 16GB [2]. Further research works showed that the performance of the model can be increased by using large pretraining datasets [3], [4]. This triggered the development of much larger datasets, especially from the common crawl. For example, C4 data contains around 750GB of text data [6] while CC-100 corpus includes around 2.5TB of text data [64]. Multilingual T-PTLMs like mBERT [2], IndT5 [87], IndoBART [88], and XLMR [64] are pretrained using only multilingual datasets. Some of the models like XLM [63], XLM-E [89], infoXLM [90], and mT6 [91] are pretrained using both multilingual and parallel datasets.

37
Q

What is a drawback of continual pretraining? i.e you start from an initizialized model and you keep training with new data?

A

the lack of target domain-specific vocabulary is a drawback in CPT when the target domain consists of many domain-specific words. For example, BioBERT [45] is initialized from general BERT and further pretrained on biomedical text. Though the language model is adapted to the biomedical domain, the vocabulary which is learned over general domain text does not include many of the domain-specific words. As a result, domain-specific words are split into a number of sub-words which hinders model learning and degrades its performance in downstream tasks.

One solution can be to add vocabulary initialize embedding but not the other layers

38
Q

List the possible pretraining tasks for language models

A

Casual Language Modeling (CLM)

  • CLM or simply Unidirectional LM predicts the next word based on the context. The unidirectional LM can handle the sequence from left-to-right or right-to-left. In let-to-right LM, the context includes all the words on the left side while in right-to-left LM, the context includes all the words on the right side. GPT-1 [1] is the first transformer-based PTLM to use CLM (left-to-right) as a pretraining task

Masked Language Modeling (MLM)–

The main drawback in CLM is the inability to leverage both contexts. Bidirectional contextual information is much better compared to unidirectional context information for encoding token representations.

All very simialr they corrupt the input but not with a special tokes [MASK] but replacing with words shuffling etc.

Replaced Token Detection (RTD
Shuffled Token Detection (STD)
Swapped Language Modeling (SLM)
Random Token Substitution (RTS)

Translation Language Modeling (TLM)
XLM and XNLG, like MLM but with parallel data

Alternate Language Modeling (ALM)
Better than TLM often, it used code switched utterances.

Next Sentence Prediction (NSP)

Sentence Order Prediction (SOP)

39
Q

What are drawback of learn language models?

A

However, MLM has two drawbacks a) provides less training signal– in MLM, the model learns from only 15% of the tokens and b) model see special mask token only during pretraining which results in a discrepancy between pretraining and fine-tuning stages

40
Q

How can you train generative model like T5 on MLM tasks? what is the new task you have to use?

A

Sequence-to-Sequence LM (Seq2SeqLM)- MLM is approached as a token level classification task over the masked tokens i.e., original words are predicted by feeding the masked token vectors to a softmax layer over the vocabulary. Seq2SeqLM is an extension of standard MLM to pretrain encoder-decoder-based models like T5 [6], mT5 [99] and MASS [114]. In the case of MLM, the context includes all the tokens in the input sequence whereas in Seq2SeqLM, the context includes all the words in the input masked sequence and the left side words in the predicted target sequence. With masked sequence as input to the encoder, the decoder predicts the masked words from left to right sequentially.

Or Denoising Auto Encoder (DAE) like BART

41
Q

What are the 3 reasons to use character or word piece embedding in big language models?

A

e a) small vocabulary size in character and sub-word embeddings compared to word embeddings. The vocabulary of word embeddings consists of all the unique words (or all the words above the cut-off frequency) in the pretraining corpus, whereas vocabulary in character embedding models consists of all the characters and vocabulary in sub-word embedding models consists of all the characters, frequently occurring sub-words and words. The size of vocabulary also determines the overall size of pretrained language model [116].

b) can represent any word and hence overcome the problem of OOV words which is a serious problem with word embeddings
c) can encode fine-grained information at character or subword levels in word representation.

42
Q

What is the difference between unigram tokenizer

and wordpiece and BPE?

A

Except Unigram, tokenizers like WordPiece, BPE, bBPE generate vocabulary by starting with base vocabulary having only the characters and iteratively augment the vocabulary until the predefined size is reached.

Unigram starts with a large vocabulary and then arrives at a vocabulary of predef ined size by iteratively cutting the characters which is exactly opposite to what happens in BPE and WordPiece

43
Q

What is the main difference between Sentencpiece and other tokenizer?

A

Tokenizers like WordPiece and BPE assume space as a word separator in the input text which is not true in all cases. To overcome this, SentencePiece tokenizer treats space as a character and then generates the vocabulary using BPE or Unigram.

44
Q

What are some encoder based models?

A

In general, an encoder-based T-PTLM consists of an embedding layer followed by a stack of encoder layers. For example, the BERT-base model consists of 12 encoder layers while the BERT-large model consists of 24 encoder layers [2]. The output from the last encoder layer is treated as the final contextual representation of the input sequence. In general, encoder-based models like BERT [2], XLNet [3], RoBERTa [4], ELECTRA [5], ALBERT [7] and XLM-E [89] are used in NLU tasks.

45
Q

What are some decoder based models?

A

Adecoder-based T-PTLM consists of an embedding layer followed by a stack of decoder layers. Here transformer

decoderlayerconsistsofonlymaskedmulti-headattentionandfeed-forwardnetworklayers.Themulti-head attentionmodulewhichperformsencoder-decodercross attentionisremoved.Ingeneral,decoder-basedmodels likeGPT-1[1],GPT-2[61]andGPT-3[27]areusedin NLGtasks.

46
Q

What are some encoder-decoder based models?

A

Translation, Text Summarization, etc. MASS [114] is the first encoder-decoder based T-PTLM model. It is pretrained using Seq2SeqLM, an extension of MLM to encoder-decoder architectures. Following MASS, a number of encoder-decoder models like T5 [6], mT5 [99], mT6 [91], BART [8], mBART [65], PLBART [37], PEGAUSUS [9] and PALM [156] are proposed in the recent times. For example, Models like MASS and BART use bidirectional encoder over corrupted text and lef-to-right auto regressive decoder to reconstruct the original text

47
Q

What is the most commonly used testset for natural language understanding ability and pre-trained language models?

A

GLUE [260] and SuperGLUE [261] benchmarks are the commonly used benchmarks to evaluate the natural language understanding ability of pretrained language models. GLUE benchmark consists of nine tasks which include both single sentence and sentence pair tasks. With rapid progress in model development, the models achieved good performance in the GLUE benchmark resulting in little space for further improvement [261].

48
Q

What is the main problem of vanilla fine tuning? How can you limit that?

A

overfitting

49
Q

What are adapters in the context of parameter efficient fine tuning?

A

Adapters [234]– The adapter is a special trainable layer module proposed by Houlsby et al. [234] to fine-tune pretrained language models in a parameter-efficient way. The adapter module consists of two feed-forward layers with a non-linear layer in between and a skip connection. The adapter module projects the input vector into a small vector and then projects back into the original dimension using the two feed-forward layers and nonlinear layer. Let x be the original vector dimension and y be the small vector dimension, then the total number parameters in the adapter module are 2xy + x + y. By setting x &laquo_space;y, we can further reduce the number of parameters in the adapter module. The small vector dimension (y) provides a trade-off between performance and parameter efficiency. Adapters are added to each of the sublayers in transformer layer before layer normalization. During fine-tuning, only parameters of adapters, layer normalization in each transformer layer, and taskspecific layers are only updated while the rest of the parameters in pretrained model are kept frozen.

50
Q

What are the typical techniques for model compression and acceleration in NLP big language models?

A

Parameter pruning and sharing:These methods focus on removing in essential parameters from deep neural network without any significant effect on the performance

Low-rank factorization: These methods identify redundant parameters of deep neural networks by employing the matrix and tensor decomposition

Transferred compact convolutional filters: These methods remove inessential parameters by transferring
or compressing the convolutional filters

Knowledge distillation (KD): These methods distill
the knowledge from a larger deep neural network into
a small network
51
Q

What is the complexity of a transformers layer?

A

This layer has a computational complexity ofO(n2 ∗d) that scales quadratically with the length of the input (n) and linearly with the length of the model hidden size (d).

52
Q

What can knowledge distillation be used for to compress model apart from teacher -student?

A
mutual learning (Zhang et al., 2018b), 
assistant teaching (Mirzadeh et al., 2020), 
lifelong learning (Zhai et al., 2019), and 
self-learning (Yuan et al., 2020). M
53
Q

Apart from making the model smaller what else can knowledge distillation be used for ?

A

Furthermore, the knowledge transfer from one model to another in knowledge distillation can be extended to other tasks, such as adversarial attacks (Papernot et al., 2016), data augmentation (Lee et al., 2019a; Gordon and Duh, 2019), data privacy and security (Wang et al., 2019a).

dataset distillation, which transfers the knowledge from a large dataset into a small dataset to reduce the training loads of deep models (Wang et al., 2018c; Bohdal et al., 2020)

54
Q

What are pros and cons of SLU systems splitted in a NLU and ASR component?

A

This pipeline approach allows the two components to be developed separately, with ASR trained on labeled audio data and NLU trained on text-only data, leading to faster iteration speed and easier maintainability. On the other hand, this approach also comes with several limitations, such as being prone to error propagation from ASR to NLU, lack of acoustic information which limits NLU’s accuracy, and lack of parameter sharing which makes it difficult to bring SLU on-device.

55
Q

What are the main limitation of e2E asr+NLU sytems?

A

Firstly, E2E systems are often treated as a black box of audio to semantics without the ability to output transcripts [1–5]. In practice, transcript generation is a requirement for many speech-based applications; in addition, having access to the transcripts is beneficial for debugging and understanding the system’s behavior. Secondly, E2E SLU often targets domain/intent prediction [1, 2] or slot tagging [3–5], but does not appear to solve complex understanding use cases.

Thirdly, certain E2E SLU solutions have been proposed that combine text and audio data via a neural interface to produce the semantic parse and thus retain the ability to output transcripts [6,7]. However, these systems are understudied from an efficiency, robustness, and scalability perspective