Hugging Face ecosystem | HF NLP course | 5. The Hugging Face Datasets library Flashcards
Slicing and dicing our data: Load TSV data
from datasets import load_dataset data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"} # \t is the tab character in Python drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
Slicing and dicing our data: Find the number of unique items in each split
for split in drug_dataset.keys(): assert len(drug_dataset[split]) == len(drug_dataset[split].unique(col_name)) # BR for split, subset in dataset.items() subset.unique(col_name)
Slicing and dicing our data: Filter rows where a column is None using a lambda expression
drug_dataset = drug_dataset.filter(lambda x: x[“condition”] is not None)
https://huggingface.co/course/chapter5/3?fw=pt#slicing-and-dicing-our-data
Creating new columns: Create a new column with a function (that computes the number of words in a text column) and map()
def compute_review_length(example):
return {“review_length”: len(example[“review”].split())}
drug_dataset = drug_dataset.map(compute_review_length)
Creating new columns: Sort a dataset by a column
drug_dataset[“train”].sort(“review_length”)
The map() method’s superpowers: How to speed up applying a function to a column
.map(…, batched=True)
The map() method’s superpowers: How to speed up applying a function to a column using parallelization
.map(…, num_proc=n)
[website section] The Hugging Face Datasets library [page] Time to slice and dice -> how to use Datasets to clean data [page section] The map() method’s superpowers [q] Truncate while tokenizing but return all chunks.
def tokenize_and_split(examples): return tokenizer( examples["review"], truncation=True, max_length=128, return_overflowing_tokens=True, )
From Datasets to DataFrames and back: Change the format of a Dataset to pandas
drug_dataset.set_format("pandas") # or df = drug_dataset.to_pandas()
Creating a validation set: Create a validation set
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42) # Rename the default "test" split to "validation" drug_dataset_clean["validation"] = drug_dataset_clean.pop("test") # Add the "test" set to our `DatasetDict` drug_dataset_clean["test"] = drug_dataset["test"] drug_dataset_clean
Saving a dataset: Save dataset in multiple splits in JSON format
for split, dataset in drug_dataset_clean.items(): dataset.to_json(f"drug-reviews-{split}.jsonl")
What if my dataset isn’t on the Hub?: Loading a local dataset: [q] What is the basic syntax for loading a (local) dataset?
Source: What if my dataset isn’t on the Hub? Loading a local dataset
from datasets import load_dataset squad_it_dataset = load_dataset("json", data_files="SQuAD_it-train.json", field="data")
Source: What if my dataset isn’t on the Hub? Loading a local dataset
Display the memory usage of a huge process.
Process.memory_info is expressed in bytes, so convert to megabytes
import psutil
print(f”RAM used: {psutil.Process().memory_info().rss / (1024 * 1024):.2f} MB”)
How to load a dataset that doesn’t fit into machine memory
pubmed_dataset_streamed = load_dataset("json", data_files=data_files, split="train", streaming=True)
How to access an element of a streamed dataset
next(iter(pubmed_dataset_streamed))
Example of how to tokenize a streamed dataset
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") tokenized_dataset = pubmed_dataset_streamed.map(lambda x: tokenizer(x["text"])) next(iter(tokenized_dataset))
How to select elements of a streamed dataset
dataset_head = pubmed_dataset_streamed.take(5) list(dataset_head)
Way to combine multiple datasets that don’t fit into memory together
from itertools import islice from datasets import interleave_datasets combined_dataset = interleave_datasets([pubmed_dataset_streamed, law_dataset_streamed]) list(islice(combined_dataset, 2))
Function to create an iterable to return selected elements from an iterable
isslice()
Python review: How to get data via HTTP.
Source: Creating your own dataset > Getting the data
response = requests.get(url)
Source: Creating your own dataset > Getting the data
Pandas review: How to create a dataframe from a list of dicts.
Source: Creating your own dataset > Getting the data
df = pd.DataFrame.from_records(all_issues)
Source: Creating your own dataset > Getting the data
Pandas review: How to write out a dataframe as line-delimited JSON.
Source: Creating your own dataset > Getting the data
df.to_json(f”{issues_path}/{repo}-issues.jsonl”, orient=”records”, lines=True)
Source: Creating your own dataset > Getting the data
Idiom to remove the complement of a set of columns from a Dataset.
Source: Semantic search with FAISS
columns = issues_dataset.column_names columns_to_keep = ["title", "body", "html_url", "comments"] columns_to_remove = set(columns_to_keep).symmetric_difference(columns) issues_dataset = issues_dataset.remove_columns(columns_to_remove)
Source: Semantic search with FAISS
Pandas function to take a column with list elements and repeat other column elements while flattening each list element.
Source: Semantic search with FAISS
comments_df = df.explode(“comments”, ignore_index=True)
comments_df.head(4)
Source: Semantic search with FAISS
Simple way to speed up the embedding process.
Source: Semantic search with FAISS
device = torch.device(“cuda”)
model.to(device)
Source: Semantic search with FAISS
Review: Idiom to get a tokenizer and a model.
Source: Semantic search with FAISS
model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1" tokenizer = AutoTokenizer.from_pretrained(model_ckpt) model = AutoModel.from_pretrained(model_ckpt)
Source: Semantic search with FAISS
Idiom to do CLS pooling.
Source: Semantic search with FAISS
def cls_pooling(model_output): return model_output.last_hidden_state[:, 0]
Source: Semantic search with FAISS
How to index a Dataset column with FAISS.
Source: Semantic search with FAISS
embeddings_dataset.add_faiss_index(column=”embeddings”)
Source: Semantic search with FAISS
Use the FAISS index to do nearest neighbor search and return in descending order.
Source: Semantic search with FAISS
scores, samples = embeddings_dataset.get_nearest_examples( "embeddings", question_embedding, k=5 ) import pandas as pd samples_df = pd.DataFrame.from_dict(samples) samples_df["scores"] = scores samples_df.sort_values(""scores"", ascending=False, inplace=True)
Source: Semantic search with FAISS
PyTorch review: Meaning of x.unsqueeze(-1)
https://www.educba.com/pytorch-unsqueeze/
Reference: https://www.educba.com/pytorch-unsqueeze/
PyTorch review: Idiom to sum across rows.
torch.sum(a,1)
Source: https://pytorch.org/docs/stable/generated/torch.sum.html
PyTorch review: How to compute the Euclidean norm.
https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html
https://pytorch.org/docs/stable/generated/torch.nn.functional.normalize.html
Idiom to do mean pooling.
def mean_pooling(model_output, attention_mask): token_embeddings = model_output.last_hidden_state input_mask_expanded = {attention_mask.unsqueeze(-1).expand(token_embeddings).size().float()} return torch.sum(token_embeddings*input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) # Normalize the embeddings. sentence_embeddings = F.normalize(sentence_embeddings p=2, dim=1) print(f"Sentence embeddings shape: {sentence_embeddings.size()}")