NLP Flashcards

1
Q

Tokenisation approaches

A

Word-based (“It”, “‘s”, “1”, uses language-specific rules), Character-based (“I”, “t”, “’”), Subword-based (“_It”, “_a”, “_gre”, learned for particular dataset/language)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

xxbos / xxeos

A

Marks the beginning/end

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

xxmaj

A

Capital letter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

xxunk

A

Placeholder for unknown token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

xxrep

A

Count of repetition (xxrep, 4, !)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

xxwrep

A

Same as xxrep, but with words instead of single characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

xxup

A

Before a token that is written all-caps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is numericalization?

A

Mapping of tokens to integers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is vocab?

A

List of all tokens in the training set (by convention sorted by frequency)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

One-hot encoding?

A

Instead of index numbers, use binary vectors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is embedding?

A

technique for representing words, phrases, or even entire sentences as dense, continuous vectors in a high-dimensional space. These embeddings capture semantic information, meaning that words with similar meanings tend to be mapped to vectors that are close together in the vector space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Self-supervised learning?

A

Learn about the domain by the structure of the data itself
Here: Given the previous tokens, predict the next token

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Training procedure for self-supervised learning?

A

First: Fine-tune generic pre-trained model using self-supervised learning on unlabeled texts from the dataset to learn embeddings and the encoder
Second: Use fine-tuned language model to train task-specific classifier with supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Default loss metric for NLP?

A

“Perplexity” (exp(cross_entropy))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can you generate new text with next token prediction?

A

With a language model you can generate new texts by using its own next token predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Image data is naturally numeric, while the same is not true for text data.

A

True

17
Q

Numericalization means to exchange each token with its embedding.

A

True

18
Q

Advantages of subword tokenization

A

Handling Out-of-Vocabulary (OOV) Words, Dealing with Morphologically Rich Languages, Reducing Vocabulary Size, Improved Generalization