Hugging Face ecosystem | HF NLP course | 6. The Hugging Face Tokenizers library | Priority Flashcards

1
Q

Training a new tokenizer: Review: Load an existing tokenizer

A

“from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(““gpt2””)”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

[page] Training a new tokenizer from an old one [page section] Training a new tokenizer [q] Training a new tokenizer: Function to train a new tokenizer

A

tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

[page] Training a new tokenizer from an old one [page section] Saving the tokenizer [q] Saving the tokenizer: Review: Function to save a tokenizer

A

tokenizer.save_pretrained(“code-search-net-tokenizer”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The output of a tokenizer.

A

BatchEncoding object (dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The key functionality of fast tokenizers is they always keep track of what?

A

the original span of texts the final tokens come from — a feature we call offset mapping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Access the tokens.

A

encoding.tokens()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Get the index of the word each token comes from | [CLS] and [SEP] are mapped to __.

A

“encoding.word_ids()
None”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

[page] Fast tokenizers’ special powers [page section] Getting the base results with the pipeline: [q] Function to group together the three stages necessary to get the predictions from a raw text. What are the three stages?

A
"from transformers import pipeline
token_classifier = pipeline(""token-classification"")
token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."")
# tokenization, passing the inputs through the model, and post-processing"
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: Getting the base results with the pipeline: [q] Group together the tokens that correspond to the same entity.

A
"from transformers import pipeline
token_classifier = pipeline(""token-classification"", aggregation_strategy=""simple"")
token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."")"
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] How to instantiate a tokenizer and then use it for token classification.

A
"from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = ""dbmdz/bert-large-cased-finetuned-conll03-english""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = ""My name is Sylvain and I work at Hugging Face in Brooklyn.""
inputs = tokenizer(example, return_tensors=""pt"")
outputs = model(**inputs)"
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] Output shape of token classifier.

A

We have a batch with 1 sequence of 19 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 19 x 9.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly