nlp Flashcards

Question

What is TF-IDF?

Answer 1

TFIDF or Term Frequency-Inverse Document Frequency indicates the importance of a word in a set. It helps in information retrieval with numerical statistics. For a specific document, TF-IDF shows a frequency that helps identify the keywords in a document. The major use of TF-IDF in NLP is the extraction of useful information from crucial documents by statistical data. It is ideally used to classify and summarize the text in documents and filter out stop words. TF helps calculate the ratio of the frequency of a term in a document and the total number of terms. Whereas, IDF denotes the importance of the term in a document. The formula for calculating TF-IDF: TF(W) = (Frequency of W in a document)/(The total number of terms in the document) IDF(W) = log_e(The total number of documents/The number of documents having the term W) When TF*IDF is high, the frequency of the term is less and vice versa.

Answer 2

Perplexity is a way to express a degree of confusion a model has in predicting. More entropy = more confusion. Perplexity is used to evaluate language models in NLP. A good language model assigns a higher probability to the right prediction.

Answer 3

Time complexity of LSTM seq_length*hidden² Time complexity of transformer seq_length²*hidden When hidden size is more than the seq_length (which is normally the case), transformer is faster than LSTM.

Answer 4

AdamW is Adam with L2 regularization on weight as models with smaller weights generalize better.

Answer 5

AdamW is Adam with L2 regularization on weight as models with smaller weights generalize better.

Answer 6

BatchNorm — Compute the mean and var at each layer for every minibatch. LayerNorm — Compute the mean and var for every single sample for each layer independently.

Answer 7

Embedding matrix factorization (helps in reducing no. of parameters) No dropout Parameter sharing (helps in reducing no. of parameters and regularization)

Answer 8

RoBERTa is pretrained using only masked language modelling (MLM) while BERT model is pretrained using two pretraining tasks namely masked language modelling (MLM) and next sentence prediction (NSP).

Answer 9

Generative SSL Like predict/generate the next token or the masked one Contrastive SSL Like the next sentence prediction in BERT or the sentence order prediction in Albert (are they swapped or not?) Adversarial SSL Adversarial SSL allows the model to learn by identifying whether the tokens in the input sentence are replaced or shuffled or randomly substituted. Replaced token detection (RTD) in ELECTRA

Answer 10

Electra Adversarial SSL allows the model to learn by identifying whether the tokens in the input sentence are replaced or shuffled or randomly substituted. Replaced token detection (RTD) in ELECTRA [5], shuffled token detection (STD) [55] and random token substitution (RTS) [56] are examples of Adversarial SSL

Answer 11

For example, BERT uses WordPiece vocabulary of size around 30K, RoBERTa uses bBPE vocabulary of size around 50K, XLM [63] uses BPE vocabulary of size 95K, mBERT [2] WordPiece vocabulary of size 110K, XLM-R [64], and mBART [65] uses SentencePiece vocabulary of size 250K. Multilingual --> bigger vocabulary

Answer 12

BERT model is pretrained using text from Wikipedia and BookCorpus which amounts to 16GB [2]. Further research works showed that the performance of the model can be increased by using large pretraining datasets [3], [4]. This triggered the development of much larger datasets, especially from the common crawl. For example, C4 data contains around 750GB of text data [6] while CC-100 corpus includes around 2.5TB of text data [64]. Multilingual T-PTLMs like mBERT [2], IndT5 [87], IndoBART [88], and XLMR [64] are pretrained using only multilingual datasets. Some of the models like XLM [63], XLM-E [89], infoXLM [90], and mT6 [91] are pretrained using both multilingual and parallel datasets.

Answer 13

the lack of target domain-specific vocabulary is a drawback in CPT when the target domain consists of many domain-specific words. For example, BioBERT [45] is initialized from general BERT and further pretrained on biomedical text. Though the language model is adapted to the biomedical domain, the vocabulary which is learned over general domain text does not include many of the domain-specific words. As a result, domain-specific words are split into a number of sub-words which hinders model learning and degrades its performance in downstream tasks. One solution can be to add vocabulary initialize embedding but not the other layers

Answer 14

Casual Language Modeling (CLM) - CLM or simply Unidirectional LM predicts the next word based on the context. The unidirectional LM can handle the sequence from left-to-right or right-to-left. In let-to-right LM, the context includes all the words on the left side while in right-to-left LM, the context includes all the words on the right side. GPT-1 [1] is the first transformer-based PTLM to use CLM (left-to-right) as a pretraining task Masked Language Modeling (MLM)– The main drawback in CLM is the inability to leverage both contexts. Bidirectional contextual information is much better compared to unidirectional context information for encoding token representations. All very simialr they corrupt the input but not with a special tokes [MASK] but replacing with words shuffling etc. Replaced Token Detection (RTD Shuffled Token Detection (STD) Swapped Language Modeling (SLM) Random Token Substitution (RTS) Translation Language Modeling (TLM) XLM and XNLG, like MLM but with parallel data Alternate Language Modeling (ALM) Better than TLM often, it used code switched utterances. Next Sentence Prediction (NSP) Sentence Order Prediction (SOP)

Answer 15

However, MLM has two drawbacks a) provides less training signal– in MLM, the model learns from only 15% of the tokens and b) model see special mask token only during pretraining which results in a discrepancy between pretraining and fine-tuning stages

Answer 16

Sequence-to-Sequence LM (Seq2SeqLM)- MLM is approached as a token level classification task over the masked tokens i.e., original words are predicted by feeding the masked token vectors to a softmax layer over the vocabulary. Seq2SeqLM is an extension of standard MLM to pretrain encoder-decoder-based models like T5 [6], mT5 [99] and MASS [114]. In the case of MLM, the context includes all the tokens in the input sequence whereas in Seq2SeqLM, the context includes all the words in the input masked sequence and the left side words in the predicted target sequence. With masked sequence as input to the encoder, the decoder predicts the masked words from left to right sequentially. Or Denoising Auto Encoder (DAE) like BART

Answer 17

e a) small vocabulary size in character and sub-word embeddings compared to word embeddings. The vocabulary of word embeddings consists of all the unique words (or all the words above the cut-off frequency) in the pretraining corpus, whereas vocabulary in character embedding models consists of all the characters and vocabulary in sub-word embedding models consists of all the characters, frequently occurring sub-words and words. The size of vocabulary also determines the overall size of pretrained language model [116]. b) can represent any word and hence overcome the problem of OOV words which is a serious problem with word embeddings c) can encode fine-grained information at character or subword levels in word representation.

Answer 18

Except Unigram, tokenizers like WordPiece, BPE, bBPE generate vocabulary by starting with base vocabulary having only the characters and iteratively augment the vocabulary until the predefined size is reached. Unigram starts with a large vocabulary and then arrives at a vocabulary of predef ined size by iteratively cutting the characters which is exactly opposite to what happens in BPE and WordPiece

Answer 19

Tokenizers like WordPiece and BPE assume space as a word separator in the input text which is not true in all cases. To overcome this, SentencePiece tokenizer treats space as a character and then generates the vocabulary using BPE or Unigram.

Answer 20

In general, an encoder-based T-PTLM consists of an embedding layer followed by a stack of encoder layers. For example, the BERT-base model consists of 12 encoder layers while the BERT-large model consists of 24 encoder layers [2]. The output from the last encoder layer is treated as the final contextual representation of the input sequence. In general, encoder-based models like BERT [2], XLNet [3], RoBERTa [4], ELECTRA [5], ALBERT [7] and XLM-E [89] are used in NLU tasks.

Answer 21

Adecoder-based T-PTLM consists of an embedding layer followed by a stack of decoder layers. Here transformer decoderlayerconsistsofonlymaskedmulti-headattentionandfeed-forwardnetworklayers.Themulti-head attentionmodulewhichperformsencoder-decodercross attentionisremoved.Ingeneral,decoder-basedmodels likeGPT-1[1],GPT-2[61]andGPT-3[27]areusedin NLGtasks.

Answer 22

Translation, Text Summarization, etc. MASS [114] is the first encoder-decoder based T-PTLM model. It is pretrained using Seq2SeqLM, an extension of MLM to encoder-decoder architectures. Following MASS, a number of encoder-decoder models like T5 [6], mT5 [99], mT6 [91], BART [8], mBART [65], PLBART [37], PEGAUSUS [9] and PALM [156] are proposed in the recent times. For example, Models like MASS and BART use bidirectional encoder over corrupted text and lef-to-right auto regressive decoder to reconstruct the original text

Answer 23

GLUE [260] and SuperGLUE [261] benchmarks are the commonly used benchmarks to evaluate the natural language understanding ability of pretrained language models. GLUE benchmark consists of nine tasks which include both single sentence and sentence pair tasks. With rapid progress in model development, the models achieved good performance in the GLUE benchmark resulting in little space for further improvement [261].

Answer 24

overfitting

Answer 25

Adapters [234]– The adapter is a special trainable layer module proposed by Houlsby et al. [234] to fine-tune pretrained language models in a parameter-efficient way. The adapter module consists of two feed-forward layers with a non-linear layer in between and a skip connection. The adapter module projects the input vector into a small vector and then projects back into the original dimension using the two feed-forward layers and nonlinear layer. Let x be the original vector dimension and y be the small vector dimension, then the total number parameters in the adapter module are 2xy + x + y. By setting x << y, we can further reduce the number of parameters in the adapter module. The small vector dimension (y) provides a trade-off between performance and parameter efficiency. Adapters are added to each of the sublayers in transformer layer before layer normalization. During fine-tuning, only parameters of adapters, layer normalization in each transformer layer, and taskspecific layers are only updated while the rest of the parameters in pretrained model are kept frozen.

Answer 26

Parameter pruning and sharing:These methods focus on removing in essential parameters from deep neural network without any significant effect on the performance Low-rank factorization: These methods identify redundant parameters of deep neural networks by employing the matrix and tensor decomposition Transferred compact convolutional filters: These methods remove inessential parameters by transferring or compressing the convolutional filters ``` Knowledge distillation (KD): These methods distill the knowledge from a larger deep neural network into a small network ```

Answer 27

This layer has a computational complexity ofO(n2 ∗d) that scales quadratically with the length of the input (n) and linearly with the length of the model hidden size (d).

Answer 28

``` mutual learning (Zhang et al., 2018b), assistant teaching (Mirzadeh et al., 2020), lifelong learning (Zhai et al., 2019), and self-learning (Yuan et al., 2020). M ```

Answer 29

Furthermore, the knowledge transfer from one model to another in knowledge distillation can be extended to other tasks, such as adversarial attacks (Papernot et al., 2016), data augmentation (Lee et al., 2019a; Gordon and Duh, 2019), data privacy and security (Wang et al., 2019a). dataset distillation, which transfers the knowledge from a large dataset into a small dataset to reduce the training loads of deep models (Wang et al., 2018c; Bohdal et al., 2020)

Answer 30

This pipeline approach allows the two components to be developed separately, with ASR trained on labeled audio data and NLU trained on text-only data, leading to faster iteration speed and easier maintainability. On the other hand, this approach also comes with several limitations, such as being prone to error propagation from ASR to NLU, lack of acoustic information which limits NLU’s accuracy, and lack of parameter sharing which makes it difficult to bring SLU on-device.

Answer 31

Firstly, E2E systems are often treated as a black box of audio to semantics without the ability to output transcripts [1–5]. In practice, transcript generation is a requirement for many speech-based applications; in addition, having access to the transcripts is beneficial for debugging and understanding the system’s behavior. Secondly, E2E SLU often targets domain/intent prediction [1, 2] or slot tagging [3–5], but does not appear to solve complex understanding use cases. Thirdly, certain E2E SLU solutions have been proposed that combine text and audio data via a neural interface to produce the semantic parse and thus retain the ability to output transcripts [6,7]. However, these systems are understudied from an efficiency, robustness, and scalability perspective