Ch10 NLP Deep Dive End-of-Chapter Questions Flashcards

Question 1

Q

What is self-supervised learning?

Answer

A

Training a model using labels that are part of the independent variable, rather than using external labels

Question 2

Q

What is a language model?

Answer

A

A language model is a model that has been trained to guess the next word in a text via self-supervised learning.

Question 3

Q

What are self-supervised models usually used for?

Answer

A

These models are usually used for transfer learning, i.e. trained for another task

Question 4

Q

Why do we fine-tune language models?

Answer

A

Fine-tuning a language model with the text data you want to use significantly boosts model performance. Your data may use different vocabulary or writing style. Fine-tuning familiarizes the language model with the specific traits of your text data.

Question 5

Q

What are the 3 steps to create a state-of-the-art text classifier?

Answer

A

Language model is pre-trained on large corpus, e.g. Wikipedia
Fine-tune language model with type of text that will be classified
Fine-tune the model for classification

Question 6

Q

What are the 3 steps to prepare your data for a language model?

Answer

A

Tokenization
Numericalization
Create a dependent variable that is offset from the independent variable by one token (since model is predicting the next word in the text)

Question 7

Q

What is tokenization? Why do we need it?

Answer

A

Tokenization involves converting the text into a list of words/characters/substrings. Since the language model will predict the next word in the text, it is necessary to define what constitutes a “word”.

Question 8

Q

Name 3 approaches to tokenization

Answer

A

Word-based: Split on spaces + additional rules for punctuation, etc
Subword based: Split words into smaller parts based on the most commonly occuring substrings. (Particularly useful for languages that don’t use spaces the same way as in English)
Character-based: Split a sentence into its individual characters

Question 9

Q

What is xxbos? How does it help the model?

Answer

A

bos means “beginning of stream”, e.g., in a dataset of reviews, it would mark the beginning of a review.

By recognizing the start token, the model will be able to learn that it needs to “forget” what was said previously and focus on upcoming words.

Question 10

Q

Why are repeated characters replaced with a token showing the number of repetitions and the character that’s repeated by fastai’s tokenization?

Answer

A

It helps the model’s embedding matrix to encode info about the general concept of repeated punctuation and avoids the need to encode every repetition with a separate token.

Question 11

Q

What is numericalization?

Answer

A

Numericalization involves assigning a number to each unique token in the vocab

Question 12

Q

Why would you want there to be words that are replaced with the “unknown word” token in NLP?

Answer

A

It’s useful to encode very low-frequency words with the unknown word token because there may not be enough data to train the model to use those rare words well, and it avoids creating an overly large embedding matrix (which can slow down training and use up too much memory)

Question 13

Q

Why do we need padding for text classification? Why don’t we need it for language modeling?

Answer

A

All the items in a batch are put into a single tensor. Since tensors cannot be jagged, every item has to have the same length. We can achieve this by padding the length of each item to match the length of the longest item in the batch.

It is not required for language modeling since the documents are all concatenated.

Question 14

Q

What does an embedding matrix for NLP contain? What is its shape?

Answer

A

An embedding matrix contains one row of weights for each token in the vocabulary.

The shape would be (vocab_size, embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

Question 15

Q

Context: Putting texts into batches to fine-tune a language model

With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain?

Answer

A

a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65-128)

Question 16

Q

What is perplexity? In which type of tasks is this often used as a metric?

Answer

Study These Flashcards

A

Perplexity is the exponential of the loss, i.e., torch.exp(cross_entropy).

This metric is often used in NLP for language models

Question 17

Q

Consider the following DataBlock for NLP classification:

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Why do we have to pass the vocabulary of the language model to the classifier data block?

Answer

Study These Flashcards

A

They say it is to ensure the same correspondence of tokens to index so the model can appropriately use the embeddings learned during LM fine-tuning.

I think it may be because when we create the learner for classification, we pass the original model (AWD_LSTM) rather than the fine-tuned language model:

learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

The learner would probably use the vocab of the original model if we didn’t explicitly pass the vocab for the fine_tuned model in the data block.

Question 18

Q

NLP

What is gradual unfreezing?

Answer

Study These Flashcards

A

It refers to unfreezing a few layers at a time when fine-tuning the classifier. This method boosts performance of NLP classifiers.

Question 19

Q

Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

Answer

Study These Flashcards

A

Because as better models trained to detect autogenerated content are created, these models can be used to create better text generation models that evade detection.

Ch10 NLP Deep Dive End-of-Chapter Questions Flashcards

(19 cards)