Hugging Face ecosystem | HF NLP course | 5. The Hugging Face Datasets library Flashcards
Slicing and dicing our data: Load TSV data
from datasets import load_dataset data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"} # \t is the tab character in Python drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
Slicing and dicing our data: Find the number of unique items in each split
for split in drug_dataset.keys(): assert len(drug_dataset[split]) == len(drug_dataset[split].unique(col_name)) # BR for split, subset in dataset.items() subset.unique(col_name)
Slicing and dicing our data: Filter rows where a column is None using a lambda expression
drug_dataset = drug_dataset.filter(lambda x: x[“condition”] is not None)
https://huggingface.co/course/chapter5/3?fw=pt#slicing-and-dicing-our-data
Creating new columns: Create a new column with a function (that computes the number of words in a text column) and map()
def compute_review_length(example):
return {“review_length”: len(example[“review”].split())}
drug_dataset = drug_dataset.map(compute_review_length)
Creating new columns: Sort a dataset by a column
drug_dataset[“train”].sort(“review_length”)
The map() method’s superpowers: How to speed up applying a function to a column
.map(…, batched=True)
The map() method’s superpowers: How to speed up applying a function to a column using parallelization
.map(…, num_proc=n)
[website section] The Hugging Face Datasets library [page] Time to slice and dice -> how to use Datasets to clean data [page section] The map() method’s superpowers [q] Truncate while tokenizing but return all chunks.
def tokenize_and_split(examples): return tokenizer( examples["review"], truncation=True, max_length=128, return_overflowing_tokens=True, )
From Datasets to DataFrames and back: Change the format of a Dataset to pandas
drug_dataset.set_format("pandas") # or df = drug_dataset.to_pandas()
Creating a validation set: Create a validation set
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42) # Rename the default "test" split to "validation" drug_dataset_clean["validation"] = drug_dataset_clean.pop("test") # Add the "test" set to our `DatasetDict` drug_dataset_clean["test"] = drug_dataset["test"] drug_dataset_clean
Saving a dataset: Save dataset in multiple splits in JSON format
for split, dataset in drug_dataset_clean.items(): dataset.to_json(f"drug-reviews-{split}.jsonl")
What if my dataset isn’t on the Hub?: Loading a local dataset: [q] What is the basic syntax for loading a (local) dataset?
Source: What if my dataset isn’t on the Hub? Loading a local dataset
from datasets import load_dataset squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
Source: What if my dataset isn’t on the Hub? Loading a local dataset
Display the memory usage of a huge process.
Process.memory_info is expressed in bytes, so convert to megabytes
import psutil
print(f”RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB”)
How to load a dataset that doesn’t fit into machine memory
pubmed_dataset_streamed = load_dataset("json", data_files=data_files, split="train", streaming=True)
How to access an element of a streamed dataset
next(iter(pubmed_dataset_streamed))