Hugging Face ecosystem | HF NLP course | 2. Using Hugging Face Transformers | Priority Flashcards
Code to get a tokenizer.
hugging-face tokenizers
from transformers import AutoTokenizer checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Code to send example input through a tokenizer with arguments.
hugging-face tokenizers
raw_inputs = [ "I've been waiting for a HuggingFace course my whole life.", "I hate this so much!", ] inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") print(inputs)
Output format from a tokenizer.
hugging-face tokenizers
{'input_ids': tensor[[sentence1,…], 'attention_mask': tensor[[sentence1,…]}
Code to get a model (not for a specific task).
hugging-face transformers
from transformers import AutoModel checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModel.from_pretrained(checkpoint)
hugging-face tokenizers
What are the output dimensions of a Transformer module?
hugging-face transformers
batch size, sequence length, hidden size
Example code to feed the outputs of a tokenizer into a model.
hugging-face tokenizers transformers
outputs = model(**inputs) print(outputs.last_hidden_state.shape)
Example code to get a model that will classify text. What will the output shape be?
hugging-face transformers
from transformers import AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModelForSequenceClassification.from_pretrained(checkpoint) outputs = model(**inputs) # (batch_size, num_classes)
[page] Tokenizers: [page section] Loading and saving: [q] Code to use the AutoTokenizer class to grab the proper tokenizer class in the library based on the checkpoint name.
"from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained(""bert-base-cased"")"
[page] Tokenizers:[page section] Loading and saving: [q] What are the 3 elements in the dict output of a tokenizer?
input_ids, token_type_ids, attention_mask
[page] Tokenizers: [page section] Loading and saving: [q] Saving a tokenizer.
tokenizer.save_pretrained(“directory_on_my_computer”)
[page] Handling multiple sequences: [page section] Models expect a batch of inputs: [q] Models expect ? sentences by default.
multiple
[page] Putting it all together: [page section]: Wrapping up: From tokenizer to model: [q] Write a code snippet that uses the tokenizer API to tokenize 2 sequences (using 3 arguments) and run them through a sequence classification model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForSequenceClassification.from_pretrained(checkpoint) sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt") output = model(**tokens)
[page] Processing the data: [page section]: Loading a dataset from the Hub: [q] Code to download and cache the GLUE benchmark dataset from the Hugging Face hub.
from datasets import load_dataset raw_datasets = load_dataset(""glue"")
[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to inspect the features of a dataset?
dataset.features
[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to tokenize all elements in a HF dataset?
"def tokenize_function(example): return tokenizer(…) tokenized_dataset = dataset.map(tokenize_function)"
Source > video
[page] Processing the data: [page section] Preprocessing a dataset: [q] What is the idiom to decode IDs to words?
tokenizer.convert_ids_to_tokens(inputs[“input_ids”])
[page] Processing the data: [page section] Dynamic padding: [q] What is a collate function? What is the default behavior?
The function that is responsible for putting together samples inside a batch. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them.
[page] Processing the data: [page section] Dynamic padding: [q] Example of how to create a collator?
“from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch = data_collator(samples)”
When not using the Trainer class, what does creating the DataLoader objects look like (using a collator for dynamic padding)?
"from torch.utils.data import DataLoader from transformers import DataCollatorWithPadding data_collator = DataCoIIatorWithPadding(tokenizer) train_dataloader = DataLoader( tokenized_datasets[""train""], shuffle=True, batch_size=8, collate_fn=data_collator ) eval_dataloader = DataLoader( tokenized_datasets[""validation""], batch_size=8, collate_fn=data_collator ) for step, batch in enumerate(train_dataloader): print(batch[ ""input_ids""].shape) if step > 5: break "
[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to pass parameters to training?
“from transformers import TrainingArguments
training_args = TrainingArguments(““test-trainer””)”
[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to instantiate a Trainer and start training, with evaluation per epoch?
"from transformers import Trainer trainer = Trainer( model, training_args, train_dataset=tokenized_datasets[""train""], eval_dataset=tokenized_datasets[""validation""], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics, ) trainer.train()"
[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] What is the output of the trainer.predict() method?
named tuple with three fields: predictions, label_ids, and metrics
[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Idiom to get the predictions from the output of trainer.predict()?
import numpy as np
predictions = trainer.predict(tokenized_datasets["”validation””])
preds = np.argmax(predictions.predictions, axis=-1)
[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Example compute_metrics() function on mrpc to be passed to a Trainer? What is the eval_preds argument?
“def compute_metrics(eval_preds):
metric = evaluate.load(““glue””, ““mrpc””)
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#eval_preds = EvalPredict object = namedtuple with a predictions field and a label_ids field”
[q] When not using the Trainer class, write a basic training loop.
"from tqdm.auto import tqdm progress_bar = tqdm(range(num_training_steps)) model.train() for epoch in range(num_epochs): for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1)"
[q] Write a basic evaluation loop using a glue metric.
"import evaluate metric = evaluate.load(""glue"", ""mrpc"") model.eval() for batch in eval_dataloader: batch = {k: v.to(device) for k, v in batch.items()} with torch.no_grad(): outputs = model(**batch) logits = outputs.logits predictions = torch.argmax(logits, dim=-1) metric.add_batch(predictions=predictions, references=batch[""labels""]) metric.compute()"