Ch10 NLP Deep Dive End-of-Chapter Questions Flashcards

1
Q

What is self-supervised learning?

A

Training a model using labels that are part of the independent variable, rather than using external labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a language model?

A

A language model is a model that has been trained to guess the next word in a text via self-supervised learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are self-supervised models usually used for?

A

These models are usually used for transfer learning, i.e. trained for another task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do we fine-tune language models?

A

Fine-tuning a language model with the text data you want to use significantly boosts model performance. Your data may use different vocabulary or writing style. Fine-tuning familiarizes the language model with the specific traits of your text data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the 3 steps to create a state-of-the-art text classifier?

A
  1. Language model is pre-trained on large corpus, e.g. Wikipedia
  2. Fine-tune language model with type of text that will be classified
  3. Fine-tune the model for classification
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 3 steps to prepare your data for a language model?

A
  1. Tokenization
  2. Numericalization
  3. Create a dependent variable that is offset from the independent variable by one token (since model is predicting the next word in the text)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is tokenization? Why do we need it?

A

Tokenization involves converting the text into a list of words/characters/substrings. Since the language model will predict the next word in the text, it is necessary to define what constitutes a “word”.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name 3 approaches to tokenization

A
  1. Word-based: Split on spaces + additional rules for punctuation, etc
  2. Subword based: Split words into smaller parts based on the most commonly occuring substrings. (Particularly useful for languages that don’t use spaces the same way as in English)
  3. Character-based: Split a sentence into its individual characters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is xxbos? How does it help the model?

A

bos means “beginning of stream”, e.g., in a dataset of reviews, it would mark the beginning of a review.

By recognizing the start token, the model will be able to learn that it needs to “forget” what was said previously and focus on upcoming words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why are repeated characters replaced with a token showing the number of repetitions and the character that’s repeated by fastai’s tokenization?

A

It helps the model’s embedding matrix to encode info about the general concept of repeated punctuation and avoids the need to encode every repetition with a separate token.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is numericalization?

A

Numericalization involves assigning a number to each unique token in the vocab

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why would you want there to be words that are replaced with the “unknown word” token in NLP?

A

It’s useful to encode very low-frequency words with the unknown word token because there may not be enough data to train the model to use those rare words well, and it avoids creating an overly large embedding matrix (which can slow down training and use up too much memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why do we need padding for text classification? Why don’t we need it for language modeling?

A

All the items in a batch are put into a single tensor. Since tensors cannot be jagged, every item has to have the same length. We can achieve this by padding the length of each item to match the length of the longest item in the batch.

It is not required for language modeling since the documents are all concatenated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does an embedding matrix for NLP contain? What is its shape?

A

An embedding matrix contains one row of weights for each token in the vocabulary.

The shape would be (vocab_size, embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Context: Putting texts into batches to fine-tune a language model

With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain?

A

a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65-128)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is perplexity? In which type of tasks is this often used as a metric?

A

Perplexity is the exponential of the loss, i.e., torch.exp(cross_entropy).

This metric is often used in NLP for language models

17
Q

Consider the following DataBlock for NLP classification:

dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

Why do we have to pass the vocabulary of the language model to the classifier data block?

A

They say it is to ensure the same correspondence of tokens to index so the model can appropriately use the embeddings learned during LM fine-tuning.

I think it may be because when we create the learner for classification, we pass the original model (AWD_LSTM) rather than the fine-tuned language model:
~~~
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
~~~

The learner would probably use the vocab of the original model if we didn’t explicitly pass the vocab for the fine_tuned model in the data block.

18
Q

NLP

What is gradual unfreezing?

A

It refers to unfreezing a few layers at a time when fine-tuning the classifier. This method boosts performance of NLP classifiers.

19
Q

Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

A

Because as better models trained to detect autogenerated content are created, these models can be used to create better text generation models that evade detection.