Hugging Face ecosystem | HF NLP course | 2. Using Hugging Face Transformers

Code to get a tokenizer.

hugging-face tokenizers

from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Source

How well did you know this?

Not at all

Perfectly

Code to send example input through a tokenizer with arguments.

hugging-face tokenizers

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Source

How well did you know this?

Not at all

Perfectly

Output format from a tokenizer.

hugging-face tokenizers

{'input_ids': tensor[[sentence1,…], 'attention_mask': tensor[[sentence1,…]}

Source

How well did you know this?

Not at all

Perfectly

Code to get a model (not for a specific task).

hugging-face transformers

from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

hugging-face tokenizers

How well did you know this?

Not at all

Perfectly

What are the output dimensions of a Transformer module?

hugging-face transformers

batch size, sequence length, hidden size

Source

How well did you know this?

Not at all

Perfectly

Example code to feed the outputs of a tokenizer into a model.

hugging-face tokenizers transformers

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Source

How well did you know this?

Not at all

Perfectly

Example code to get a model that will classify text. What will the output shape be?

hugging-face transformers

from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs) # (batch_size, num_classes)

Source

How well did you know this?

Not at all

Perfectly

[page] Tokenizers: [page section] Loading and saving: [q] Code to use the AutoTokenizer class to grab the proper tokenizer class in the library based on the checkpoint name.

"from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(""bert-base-cased"")"

Source

How well did you know this?

Not at all

Perfectly

[page] Tokenizers:[page section] Loading and saving: [q] What are the 3 elements in the dict output of a tokenizer?

input_ids, token_type_ids, attention_mask

Source

How well did you know this?

Not at all

Perfectly

[page] Tokenizers: [page section] Loading and saving: [q] Saving a tokenizer.

tokenizer.save_pretrained(“directory_on_my_computer”)

Source

How well did you know this?

Not at all

Perfectly

[page] Handling multiple sequences: [page section] Models expect a batch of inputs: [q] Models expect ? sentences by default.

multiple

Source

How well did you know this?

Not at all

Perfectly

[page] Putting it all together: [page section]: Wrapping up: From tokenizer to model: [q] Write a code snippet that uses the tokenizer API to tokenize 2 sequences (using 3 arguments) and run them through a sequence classification model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]					
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

Source

How well did you know this?

Not at all

Perfectly

[page] Processing the data: [page section]: Loading a dataset from the Hub: [q] Code to download and cache the GLUE benchmark dataset from the Hugging Face hub.

from datasets import load_dataset
raw_datasets = load_dataset(""glue"")

Source

How well did you know this?

Not at all

Perfectly

[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to inspect the features of a dataset?

dataset.features

Source

How well did you know this?

Not at all

Perfectly

[page] Processing the data: [page section] Loading a dataset from the Hub: [q] How to tokenize all elements in a HF dataset?

"def tokenize_function(example):
	return tokenizer(…)
tokenized_dataset = dataset.map(tokenize_function)"

Source > video

How well did you know this?

Not at all

Perfectly

[page] Processing the data: [page section] Preprocessing a dataset: [q] What is the idiom to decode IDs to words?

Study These Flashcards

tokenizer.convert_ids_to_tokens(inputs[“input_ids”])

Source

[page] Processing the data: [page section] Dynamic padding: [q] What is a collate function? What is the default behavior?

Study These Flashcards

The function that is responsible for putting together samples inside a batch. It’s an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them.

Source

[page] Processing the data: [page section] Dynamic padding: [q] Example of how to create a collator?

Study These Flashcards

“from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch = data_collator(samples)”

Source

When not using the Trainer class, what does creating the DataLoader objects look like (using a collator for dynamic padding)?

Study These Flashcards

"from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding 
data_collator = DataCoIIatorWithPadding(tokenizer) 
train_dataloader = DataLoader(
    tokenized_datasets[""train""], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets[""validation""], batch_size=8, collate_fn=data_collator
)
for step, batch in enumerate(train_dataloader): 
    print(batch[ ""input_ids""].shape) 
    if step > 5: 
        break "

Source

[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to pass parameters to training?

Study These Flashcards

“from transformers import TrainingArguments
training_args = TrainingArguments(““test-trainer””)”

Source

[page] Fine-tuning a model with the Trainer API: [page section] Training: [q] How to instantiate a Trainer and start training, with evaluation per epoch?

Study These Flashcards

"from transformers import Trainer
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets[""train""],
    eval_dataset=tokenized_datasets[""validation""],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()"

Source

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] What is the output of the trainer.predict() method?

Study These Flashcards

named tuple with three fields: predictions, label_ids, and metrics

Source

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Idiom to get the predictions from the output of trainer.predict()?

Study These Flashcards

import numpy as np
predictions = trainer.predict(tokenized_datasets["”validation””])
preds = np.argmax(predictions.predictions, axis=-1)

Source

[page] Fine-tuning a model with the Trainer API: [page section] Evaluation: [q] Example compute_metrics() function on mrpc to be passed to a Trainer? What is the eval_preds argument?

Study These Flashcards

“def compute_metrics(eval_preds):
metric = evaluate.load(““glue””, ““mrpc””)
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
#eval_preds = EvalPredict object = namedtuple with a predictions field and a label_ids field”

Source

[q] When not using the Trainer class, write a basic training loop.

``` "from tqdm.auto import tqdm progress_bar = tqdm(range(num_training_steps)) model.train() for epoch in range(num_epochs): for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1)" ``` ## Footnote [Source](https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt#the-training-loop)

[q] Write a basic evaluation loop using a glue metric.

``` "import evaluate metric = evaluate.load(""glue"", ""mrpc"") model.eval() for batch in eval_dataloader: batch = {k: v.to(device) for k, v in batch.items()} with torch.no_grad(): outputs = model(**batch) logits = outputs.logits predictions = torch.argmax(logits, dim=-1) metric.add_batch(predictions=predictions, references=batch[""labels""]) metric.compute()" ``` ## Footnote [Source](https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt#the-evaluation-loop)

Hugging Face ecosystem | HF NLP course | 2. Using Hugging Face Transformers | Priority Flashcards

(26 cards)