NLP Flashcards

Question 1

Q

Tokenisation approaches

Answer

A

Word-based (“It”, “‘s”, “1”, uses language-specific rules), Character-based (“I”, “t”, “’”), Subword-based (“_It”, “_a”, “_gre”, learned for particular dataset/language)

Question 2

Q

xxbos / xxeos

Answer

A

Marks the beginning/end

Question 3

Q

xxmaj

Answer

A

Capital letter

Question 4

Q

xxunk

Answer

A

Placeholder for unknown token

Question 5

Q

xxrep

Answer

A

Count of repetition (xxrep, 4, !)

Question 6

Q

xxwrep

Answer

A

Same as xxrep, but with words instead of single characters

Question 7

Q

xxup

Answer

A

Before a token that is written all-caps

Question 8

Q

What is numericalization?

Answer

A

Mapping of tokens to integers

Question 9

Q

What is vocab?

Answer

A

List of all tokens in the training set (by convention sorted by frequency)

Question 10

Q

One-hot encoding?

Answer

A

Instead of index numbers, use binary vectors

Question 11

Q

What is embedding?

Answer

A

technique for representing words, phrases, or even entire sentences as dense, continuous vectors in a high-dimensional space. These embeddings capture semantic information, meaning that words with similar meanings tend to be mapped to vectors that are close together in the vector space.

Question 12

Q

Self-supervised learning?

Answer

A

Learn about the domain by the structure of the data itself
Here: Given the previous tokens, predict the next token

Question 13

Q

Training procedure for self-supervised learning?

Answer

A

First: Fine-tune generic pre-trained model using self-supervised learning on unlabeled texts from the dataset to learn embeddings and the encoder
Second: Use fine-tuned language model to train task-specific classifier with supervised learning

Question 14

Q

Default loss metric for NLP?

Answer

A

“Perplexity” (exp(cross_entropy))

Question 15

Q

Can you generate new text with next token prediction?

Answer

A

With a language model you can generate new texts by using its own next token predictions.

Question 16

Q

Image data is naturally numeric, while the same is not true for text data.

Answer

Study These Flashcards

A

True

Question 17

Q

Numericalization means to exchange each token with its embedding.

Answer

Study These Flashcards

A

True

Question 18

Q

Advantages of subword tokenization

Answer

Study These Flashcards

A

Handling Out-of-Vocabulary (OOV) Words, Dealing with Morphologically Rich Languages, Reducing Vocabulary Size, Improved Generalization

NLP Flashcards

(18 cards)