Hugging Face ecosystem | HF NLP course | 6. The Hugging Face Tokenizers library | Priority Flashcards
Training a new tokenizer: Review: Load an existing tokenizer
“from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained(““gpt2””)”
[page] Training a new tokenizer from an old one [page section] Training a new tokenizer [q] Training a new tokenizer: Function to train a new tokenizer
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
[page] Training a new tokenizer from an old one [page section] Saving the tokenizer [q] Saving the tokenizer: Review: Function to save a tokenizer
tokenizer.save_pretrained(“code-search-net-tokenizer”)
[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The output of a tokenizer.
BatchEncoding object (dict)
[page] Fast tokenizers’ special powers [page section] Batch encoding [q] The key functionality of fast tokenizers is they always keep track of what?
the original span of texts the final tokens come from — a feature we call offset mapping.
[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Access the tokens.
encoding.tokens()
[page] Fast tokenizers’ special powers [page section] Batch encoding [q] Get the index of the word each token comes from | [CLS] and [SEP] are mapped to __.
“encoding.word_ids()
None”
[page] Fast tokenizers’ special powers [page section] Getting the base results with the pipeline: [q] Function to group together the three stages necessary to get the predictions from a raw text. What are the three stages?
"from transformers import pipeline token_classifier = pipeline(""token-classification"") token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."") # tokenization, passing the inputs through the model, and post-processing"
[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: Getting the base results with the pipeline: [q] Group together the tokens that correspond to the same entity.
"from transformers import pipeline token_classifier = pipeline(""token-classification"", aggregation_strategy=""simple"") token_classifier(""My name is Sylvain and I work at Hugging Face in Brooklyn."")"
[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] How to instantiate a tokenizer and then use it for token classification.
"from transformers import AutoTokenizer, AutoModelForTokenClassification model_checkpoint = ""dbmdz/bert-large-cased-finetuned-conll03-english"" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) model = AutoModelForTokenClassification.from_pretrained(model_checkpoint) example = ""My name is Sylvain and I work at Hugging Face in Brooklyn."" inputs = tokenizer(example, return_tensors=""pt"") outputs = model(**inputs)"
[page] Fast tokenizers’ special powers [page section] Inside the token-classification pipeline: From inputs to predictions: [q] Output shape of token classifier.
We have a batch with 1 sequence of 19 tokens and the model has 9 different labels, so the output of the model has a shape of 1 x 19 x 9.