NLP I: Introduction Flashcards

1
Q

What’s tokenization and what are the 3 different approaches?

A

Word based: [“It”, “‘s”, “a”, “great”, “movie”, “!”]

Character based: [“I”, “t”, “’”, “s”, “ “, “a”, “ “, “g”, “r”, “e”, “a”, “t”, “ “, “m”, “o”,
“v”, “i”, “e”, “!”]

Subword based: [“_It”, “’”, “s”, “_a”, “_gre”, “at”, “_movie”, “!”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the special tokens?
Name 6 of them

A

They are tokens that help reduce the effective number of tokens
List of (most) fastai tokens:
– xxbos / xxeos: Marks the beginning / end of a text document (“begin / end of stream”)
– xxmaj: In front of capitalized word, e.g., [“Movie”] → [“xxmaj”, “movie”]
– xxunk: Placeholder for unknown token (too rare or not part of the training data)
– xxrep: If a letter / punctuation is repeated three times or more, replace with count, e.g., [”!!!!”] →
[“xxrep”, “4”, “!”]
– xxwrep: Same as xxrep, but with words instead of single characters
– xxup: Before a token that is written all-caps, e.g., [“CAPS”] → [“xxup”, “caps”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What’s numericalization?

A

Mapping of tokens to integers:
- Replace string representation by index in the vocab
– Vocab: List of all tokens in the training set (by convention sorted by frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s One-Hot Encoding?

A

Instead of index numbers, use binary vectors
Each token is represented by a string of 0s, with a 1 at the place of its vocab index
– e.g. for token 2: [0, 1, 0, …, 0], token 5: [0, 0, 0, 0, 1, 0, …, 0]
– Length of each vector: Total number of tokens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are Embeddings for Categorical Variables?

A

When one hot encoded vectors become very long.
When building the model architecture, the first layer will be fully connected and reduces the dimensionality of the long input vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What’s self supervised Learning?

A

The model learns about the domain by the structure of the data itself
– Here: Given the previous tokens, predict the next token
– Other methods, e.g., “masking”, exist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Self-supervised learning used to?

A

to pre-train models for later tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What type of datasets does pretraining use?

A

such as Wikipedia

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What’s the encoder? What’s its job?

A

All but the classification head, published as the pre-trained model

the encoder’s job is to convert text into a numerical representation (embedding) that captures the
semantic and syntactic information of the text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does the fine-tuning procedure work?

A

First: Fine-tune generic pre-trained model using self-supervised learning
– Tokens are merged for task-specific dataset and pre-trained model!
– Use pre-trained embeddings for pre-trained tokens, initialize at random for new tokens

Second: Use fine-tuned language model to train task-specific classifier
– Only at this step we introduce labeled data and train in a supervised manner!
– Training process is very similar from here on
– Default loss metric: “Perplexity” (exp(cross_entropy))
– Model used here: AWD-LSTM, which is a type of recurrent neural network (RNN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we go from tokenization to embeddings?

A

Tokenization (w/ special tokens) -> Numericalization -> One-hot encoding / Embedding for Categorical values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which statements are true about text data and image data when applying deep learning? (Multiple Choice)
1. For image data, several useful data augmentation techniques are available, while data augmentation is much more challenging for text data.
2. One major challenge with image data is dealing with different image resolutions, i.e., different amounts of data per image. Text data does not have this challenge.
3. Text data can be more challenging because most time no digital representation of the text is available.
4. Image data is naturally numeric, while the same is not true for text data.

A

1,4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Perform word-based tokenization for the following sentence text. Your output should be after applying the rules for the special tokens [xxmaj, xxup, xxrep, xxwrep, xxeos] (and not any other).

Write the outputs separated by exactly one comma and one space, for example: the, movie, was, great, .

text: My trip to London by train was amazing! And unbelievably the trains were ON TIME!!!

A

xxmaj, my, trip, to, xxmaj, london, by, train, was, amazing, !, xxmaj, and, unbelievably, the, trains, were, xxup, on, xxup, time, xxrep, 3, !, xxeos

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which statements are true about numericalization?
1. Numericalization means to exchange each token with its embedding.
2. Numericalization means to exchange each token with its index in the vocab.
3. If a token is not in the vocab, it will be removed after numericalization.
4. Assuming no token is replaced by xxunk, tokenized text and numericalized text are a 1-to-1 correspondence.

A

2,4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which statements are true about one-hot encodings and embeddings? (Multiple Choice)
1. Input tokens are mapped to an embedding before being processed by further network layers.
2. In most cases, the dimension of the one-hot encoding and of the embedding are identical.
3. Embeddings are arbitrary and cannot be interpreted, but deep learning networks figure out how to differentiate them regardless.
4. Embeddings are much less time- and memory-efficient than one-hot encodings.

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which statements are true about training a text model? (Multiple Choice)
1. For a text classification task with pre-training, you generally first fine-tune the language model before training a classifier.
2. Language models are usually trained with self-supervised learning, e.g., a next token prediction task.
3. For a text classification task with pre-training, you find the intersection of tokens for the task’s dataset and the tokens from the pre-trained model and train a joint model on those.
4. For self-supervised learning tasks, your text data set has to be labeled.

A

1,2

17
Q

Which statement is true about text generation using language models?
1. With a language model you can generate new texts by using its own next token predictions.
2. With a language model you can generate new texts by training on a classification task.
3. With a language model you cannot generate new texts, it will only output texts from its training dataset.
4. With a language model you cannot generate new texts, except if the training dataset was task-specific.

A

1

18
Q

Which of the following is NOT a tokenization approach in NLP?

A) Word-based
B) Sentence-based
C) Subword-based
D) Character-based

A

Answer: B) Sentence-based

19
Q

What is the main advantage of using subword tokenization over word-based tokenization?

A) Simplicity in implementation
B) Reducing the vocabulary size and handling rare words
C) Increased computational efficiency
D) Preserving semantic meaning of individual characters

A

Answer: B) Reducing the vocabulary size and handling rare words

20
Q

Given the sentence “I love NLP”, perform word-based tokenization and convert each token to its corresponding index using the vocabulary
{“I”:1,”love”:2,”NLP”:3}.

A

Answer:

Tokens: [“I”, “love”, “NLP”]
Indices: [1, 2, 3]

21
Q

Explain the process of tokenization and its importance in NLP.

A

Answer:
Tokenization is the process of converting a string of text into smaller units called tokens, which can be words, subwords, or characters. This is a crucial step in NLP as it transforms raw text into a format that can be processed by neural networks. Tokenization enables the mapping of textual data to numerical representations, allowing models to learn patterns and relationships within the text.

22
Q

Describe self-supervised learning and how it is used in training language models.

A

Answer:
Self-supervised learning involves training a model on a task where the target labels are derived from the input data itself, rather than being externally provided. In NLP, this often involves tasks like next-word prediction or masked language modeling. Self-supervised learning allows models to learn rich representations from large amounts of unlabeled text data, which can then be fine-tuned on specific downstream tasks with supervised learning.

23
Q

A sentiment classification model achieves 90% accuracy on the training set but only 70% on the validation set. What could be the reason for this discrepancy and how might you address it?

A

Answer:
The discrepancy suggests overfitting, where the model performs well on the training data but fails to generalize to unseen data. To address this, techniques such as regularization (dropout, weight decay), data augmentation, or early stopping can be used. Additionally, obtaining more training data or simplifying the model might help improve generalization.

24
Q

Explain how a trained next-word prediction model can be used to generate text.

A

Answer:
A trained next-word prediction model generates text by predicting the most likely next word given a sequence of preceding words. Starting with an initial prompt, the model predicts the next word, appends it to the sequence, and uses the updated sequence to predict the subsequent word. This process is repeated iteratively to generate coherent and contextually relevant text.