Hugging Face ecosystem | HF NLP course | 7. Main NLP Tasks | Priority Flashcards
[q] How would you inspect the class names of a token classification dataset?
label_names = raw_datasets[“train”].features[“ner_tags”].feature.names
[q] Basics (3 steps) of how texts need to be converted to token IDs before the model can make sense of them.
“– Apply a function to tokenize and align labels for each split of the dataset with map().
– Write a function to combine tokenization and aligning labels to tokens for the examples from one split.
– Write a function to align labels with tokens for one example.”
[q] How to pad the labels the exact same way as the inputs so that they stay the same size.
“from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)”
[q] How is a metric loaded for token classification?
“!pip install seqeval
import evaluate
metric = evaluate.load(““seqeval””)”
[q] What are the basic steps in a compute_metrics() function that takes the arrays of predictions and labels, and returns a dictionary with the metric names and values?
”* Take the argmax of logits to get predictions.
* Convert integer indices to labels, ignoring special tokens.
* Call metric.compute() on the predictions and labels.
import numpy as np
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
#Remove ignored index (special tokens) and convert to labels true_labels = [[label_names[l] for l in label if l != -100] for label in labels] true_predictions = [ [label_names[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels) ] all_metrics = metric.compute(predictions=true_predictions, references=true_labels) return { ""precision"": all_metrics[""overall_precision""], ""recall"": all_metrics[""overall_recall""], ""f1"": all_metrics[""overall_f1""], ""accuracy"": all_metrics[""overall_accuracy""], }"
[q] How do you set up an Accelerator with a model to train?
“from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, train_dataloader, eval_dataloader
)”
[q] What does a postprocess() function need to do during a token classification model’s training?
takes predictions and labels and converts them to lists of strings, like our metric object expects.