Hugging Face ecosystem | HF NLP course | 2. Using Hugging Face Transformers | Priority Flashcards
Code to get a tokenizer.
hugging-face tokenizers
from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Code to send example input through a tokenizer with arguments.
hugging-face tokenizers
raw_inputs = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") print(inputs)
Output format from a tokenizer.
hugging-face tokenizers
{'input_ids': tensor[[sentence1,…], 'attention_mask': tensor[[sentence1,…]}
Code to get a model (not for a specific task).
hugging-face transformers
from transformers import AutoModel checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModel.from_pretrained(checkpoint)
hugging-face tokenizers
What are the output dimensions of a Transformer module?
hugging-face transformers
batch size, sequence length, hidden size
Example code to feed the outputs of a tokenizer into a model.
hugging-face tokenizers transformers
outputs = model(**inputs) print(outputs.last_hidden_state.shape)
Example code to get a model that will classify text. What will the output shape be?
hugging-face transformers
from transformers import AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModelForSequenceClassification.from_pretrained(checkpoint) outputs = model(**inputs) # (batch_size, num_classes)
[page] Tokenizers: [page section] Loading and saving: [q] Code to use the AutoTokenizer class to grab the proper tokenizer class in the library based on the checkpoint name.
"from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(""bert-base-cased"")"
[page] Tokenizers:[page section] Loading and saving: [q] What are the 3 elements in the dict output of a tokenizer?
input_ids, token_type_ids, attention_mask
[page] Tokenizers: [page section] Loading and saving: [q] Saving a tokenizer.
tokenizer.save_pretrained(“directory_on_my_computer”)
[page] Handling multiple sequences: [page section] Models expect a batch of inputs: [q] Models expect ? sentences by default.
multiple
[page] Putting it all together: [page section]: Wrapping up: From tokenizer to model: [q] Write a code snippet that uses the tokenizer API to tokenize 2 sequences (using 3 arguments) and run them through a sequence classification model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt") output = model(**tokens)
[page] Processing the data: [page section]: Loading a dataset from the Hub: [q] Code to download and cache the GLUE benchmark dataset from the Hugging Face hub.
from datasets import load_dataset raw_datasets = load_dataset(""glue"")
[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to inspect the features of a dataset?
dataset.features
[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to tokenize all elements in a HF dataset?
"def tokenize_function(example): return tokenizer(…) tokenized_dataset = dataset.map(tokenize_function)"
Source > video