Ch10 NLP Deep Dive End-of-Chapter Questions Flashcards
What is self-supervised learning?
Training a model using labels that are part of the independent variable, rather than using external labels
What is a language model?
A language model is a model that has been trained to guess the next word in a text via self-supervised learning.
What are self-supervised models usually used for?
These models are usually used for transfer learning, i.e. trained for another task
Why do we fine-tune language models?
Fine-tuning a language model with the text data you want to use significantly boosts model performance. Your data may use different vocabulary or writing style. Fine-tuning familiarizes the language model with the specific traits of your text data.
What are the 3 steps to create a state-of-the-art text classifier?
- Language model is pre-trained on large corpus, e.g. Wikipedia
- Fine-tune language model with type of text that will be classified
- Fine-tune the model for classification
What are the 3 steps to prepare your data for a language model?
- Tokenization
- Numericalization
- Create a dependent variable that is offset from the independent variable by one token (since model is predicting the next word in the text)
What is tokenization? Why do we need it?
Tokenization involves converting the text into a list of words/characters/substrings. Since the language model will predict the next word in the text, it is necessary to define what constitutes a “word”.
Name 3 approaches to tokenization
- Word-based: Split on spaces + additional rules for punctuation, etc
- Subword based: Split words into smaller parts based on the most commonly occuring substrings. (Particularly useful for languages that don’t use spaces the same way as in English)
- Character-based: Split a sentence into its individual characters
What is xxbos
? How does it help the model?
bos means “beginning of stream”, e.g., in a dataset of reviews, it would mark the beginning of a review.
By recognizing the start token, the model will be able to learn that it needs to “forget” what was said previously and focus on upcoming words.
Why are repeated characters replaced with a token showing the number of repetitions and the character that’s repeated by fastai’s tokenization?
It helps the model’s embedding matrix to encode info about the general concept of repeated punctuation and avoids the need to encode every repetition with a separate token.
What is numericalization?
Numericalization involves assigning a number to each unique token in the vocab
Why would you want there to be words that are replaced with the “unknown word” token in NLP?
It’s useful to encode very low-frequency words with the unknown word token because there may not be enough data to train the model to use those rare words well, and it avoids creating an overly large embedding matrix (which can slow down training and use up too much memory)
Why do we need padding for text classification? Why don’t we need it for language modeling?
All the items in a batch are put into a single tensor. Since tensors cannot be jagged, every item has to have the same length. We can achieve this by padding the length of each item to match the length of the longest item in the batch.
It is not required for language modeling since the documents are all concatenated.
What does an embedding matrix for NLP contain? What is its shape?
An embedding matrix contains one row of weights for each token in the vocabulary.
The shape would be (vocab_size, embedding_size), where vocab_size is the length of the vocabulary, and embedding_size is an arbitrary number defining the number of latent factors of the tokens.
Context: Putting texts into batches to fine-tune a language model
With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain?
a. The dataset is split into 64 mini-streams (batch size)
b. Each batch has 64 rows (batch size) and 64 columns (sequence length)
c. The first row of the first batch contains the beginning of the first mini-stream (tokens 1-64)
d. The second row of the first batch contains the beginning of the second mini-stream
e. The first row of the second batch contains the second chunk of the first mini-stream (tokens 65-128)