Hugging Face ecosystem | HF NLP course | 6. The Hugging Face Tokenizers library | Priority Flashcards

Question 1

Q

Training a new tokenizer: Review: Load an existing tokenizer

Answer

A

“from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(““gpt2””)”

Source

Question 2

Q

[page] Training a new tokenizer from an old one [page section] Training a new tokenizer [q] Training a new tokenizer: Function to train a new tokenizer

Answer

A

tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Source

Question 3

Q

[page] Training a new tokenizer from an old one [page section] Saving the tokenizer [q] Saving the tokenizer: Review: Function to save a tokenizer

Answer

A

tokenizer.save_pretrained(“code-search-net-tokenizer”)

Source

Question 4

Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The output of a tokenizer.

Answer

A

BatchEncoding object (dict)

Source

Question 5

Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The key functionality of fast tokenizers is they always keep track of what?

Answer

A

the original span of texts the final tokens come from — a feature we call offset mapping.

Source

Question 6

Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Access the tokens.

Answer

A

encoding.tokens()

Source

Question 7

Q

[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Get the index of the word each token comes from | [CLS] and [SEP] are mapped to __.

Answer

A

“encoding.word_ids()
None”

Source

Question 8

Q

[page] Fast tokenizers’ special powers [page section] Getting the base results with the pipeline: [q] Function to group together the three stages necessary to get the predictions from a raw text. What are the three stages?

Answer

A

"from transformers import pipeline
token_classifier = pipeline(""token-classification"")
token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."")
# tokenization, passing the inputs through the model, and post-processing"

Source

Question 9

Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: Getting the base results with the pipeline: [q] Group together the tokens that correspond to the same entity.

Answer

A

"from transformers import pipeline
token_classifier = pipeline(""token-classification"", aggregation_strategy=""simple"")
token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."")"

Source

Question 10

Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] How to instantiate a tokenizer and then use it for token classification.

Answer

A

"from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = ""dbmdz/bert-large-cased-finetuned-conll03-english""
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = ""My name is Sylvain and I work at Hugging Face in Brooklyn.""
inputs = tokenizer(example, return_tensors=""pt"")
outputs = model(**inputs)"

Source

Question 11

Q

[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] Output shape of token classifier.

Answer

A

We have a batch with 1 sequence of 19 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 19 x 9.

Source