NLP I: Introduction Flashcards
What’s tokenization and what are the 3 different approaches?
Word based: [“It”, “‘s”, “a”, “great”, “movie”, “!”]
Character based: [“I”, “t”, “’”, “s”, “ “, “a”, “ “, “g”, “r”, “e”, “a”, “t”, “ “, “m”, “o”,
“v”, “i”, “e”, “!”]
Subword based: [“_It”, “’”, “s”, “_a”, “_gre”, “at”, “_movie”, “!”]
What are the special tokens?
Name 6 of them
They are tokens that help reduce the effective number of tokens
List of (most) fastai tokens:
– xxbos / xxeos: Marks the beginning / end of a text document (“begin / end of stream”)
– xxmaj: In front of capitalized word, e.g., [“Movie”] → [“xxmaj”, “movie”]
– xxunk: Placeholder for unknown token (too rare or not part of the training data)
– xxrep: If a letter / punctuation is repeated three times or more, replace with count, e.g., [”!!!!”] →
[“xxrep”, “4”, “!”]
– xxwrep: Same as xxrep, but with words instead of single characters
– xxup: Before a token that is written all-caps, e.g., [“CAPS”] → [“xxup”, “caps”]
What’s numericalization?
Mapping of tokens to integers:
- Replace string representation by index in the vocab
– Vocab: List of all tokens in the training set (by convention sorted by frequency)
What’s One-Hot Encoding?
Instead of index numbers, use binary vectors
Each token is represented by a string of 0s, with a 1 at the place of its vocab index
– e.g. for token 2: [0, 1, 0, …, 0], token 5: [0, 0, 0, 0, 1, 0, …, 0]
– Length of each vector: Total number of tokens
What are Embeddings for Categorical Variables?
When one hot encoded vectors become very long.
When building the model architecture, the first layer will be fully connected and reduces the dimensionality of the long input vector
What’s self supervised Learning?
The model learns about the domain by the structure of the data itself
– Here: Given the previous tokens, predict the next token
– Other methods, e.g., “masking”, exist
What is Self-supervised learning used to?
to pre-train models for later tasks
What type of datasets does pretraining use?
such as Wikipedia
What’s the encoder? What’s its job?
All but the classification head, published as the pre-trained model
the encoder’s job is to convert text into a numerical representation (embedding) that captures the
semantic and syntactic information of the text.
How does the fine-tuning procedure work?
First: Fine-tune generic pre-trained model using self-supervised learning
– Tokens are merged for task-specific dataset and pre-trained model!
– Use pre-trained embeddings for pre-trained tokens, initialize at random for new tokens
Second: Use fine-tuned language model to train task-specific classifier
– Only at this step we introduce labeled data and train in a supervised manner!
– Training process is very similar from here on
– Default loss metric: “Perplexity” (exp(cross_entropy))
– Model used here: AWD-LSTM, which is a type of recurrent neural network (RNN)
How do we go from tokenization to embeddings?
Tokenization (w/ special tokens) -> Numericalization -> One-hot encoding / Embedding for Categorical values
Which statements are true about text data and image data when applying deep learning? (Multiple Choice)
1. For image data, several useful data augmentation techniques are available, while data augmentation is much more challenging for text data.
2. One major challenge with image data is dealing with different image resolutions, i.e., different amounts of data per image. Text data does not have this challenge.
3. Text data can be more challenging because most time no digital representation of the text is available.
4. Image data is naturally numeric, while the same is not true for text data.
1,4
Perform word-based tokenization for the following sentence text. Your output should be after applying the rules for the special tokens [xxmaj, xxup, xxrep, xxwrep, xxeos] (and not any other).
Write the outputs separated by exactly one comma and one space, for example: the, movie, was, great, .
text: My trip to London by train was amazing! And unbelievably the trains were ON TIME!!!
xxmaj, my, trip, to, xxmaj, london, by, train, was, amazing, !, xxmaj, and, unbelievably, the, trains, were, xxup, on, xxup, time, xxrep, 3, !, xxeos
Which statements are true about numericalization?
1. Numericalization means to exchange each token with its embedding.
2. Numericalization means to exchange each token with its index in the vocab.
3. If a token is not in the vocab, it will be removed after numericalization.
4. Assuming no token is replaced by xxunk, tokenized text and numericalized text are a 1-to-1 correspondence.
2,4
Which statements are true about one-hot encodings and embeddings? (Multiple Choice)
1. Input tokens are mapped to an embedding before being processed by further network layers.
2. In most cases, the dimension of the one-hot encoding and of the embedding are identical.
3. Embeddings are arbitrary and cannot be interpreted, but deep learning networks figure out how to differentiate them regardless.
4. Embeddings are much less time- and memory-efficient than one-hot encodings.
1