Deep Learning Fundamentals - Hugging Face Flashcards
What is Hugging Face
Hugging Face is a leading technology company in natural language processing (NLP) and machine learning, known for its open-source library, Transformers, which provides access to state-of-the-art NLP models and tools.
What is Tokenization? And Tokens?
It’s like cutting a sentence into individual pieces, such as words or characters, to make it easier to analyze.
Tokens are the pieces you get after cutting up text during tokenization. Can be words, parts of words, or even single letters.
These tokens are converted to numerical values for models to understand.
Code Example:
~~~
from transformers import BertTokenizer
Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
See how many tokens are in the vocabulary
tokenizer.vocab_size
# 30522
Tokenize the sentence
tokens = tokenizer.tokenize(“I heart Generative AI”)
Print the tokens
print(tokens)
# [‘i’, ‘heart’, ‘genera’, ‘##tive’, ‘ai’]
Show the token ids assigned to each token
print(tokenizer.convert_tokens_to_ids(tokens))
# [1045, 2540, 11416, 6024, 9932]
~~~
Why Hugging Face models are so good? What means no_grad method?
Hugging Face models provide a quick way to get started using models trained by the community. With only a few lines of code, you can load a pre-trained model and start using it on tasks such as sentiment analysis.
no_grad means that the models is being used only for rpediction, not for training.
Code Example
from transformers import BertForSequenceClassification, BertTokenizer
Load a pre-trained sentiment analysis model
model_name = “textattack/bert-base-uncased-imdb”
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
Tokenize the input sequence
tokenizer = BertTokenizer.from_pretrained(model_name)
inputs = tokenizer(“I love Generative AI”, return_tensors=”pt”)
Make prediction
with torch.no_grad():
outputs = model(**inputs).logits
probabilities = torch.nn.functional.softmax(outputs, dim=1)
predicted_class = torch.argmax(probabilities)
Display sentiment result
if predicted_class == 1:
print(f”Sentiment: Positive ({probabilities[0][1] * 100:.2f}%)”)
else:
print(f”Sentiment: Negative ({probabilities[0][0] * 100:.2f}%)”)
# Sentiment: Positive (88.68%)
Why HuggingFace Datasets is so good?
HuggingFace Datasets library is a powerful tool for managing a variety of data types, like text and images, efficiently and easily. This resource is incredibly fast and doesn’t use a lot of computer memory, making it great for handling big projects without any hassle.
What is Hugging Face trainers?
Hugging Face trainers offer a simplified approach to training generative AI models.
What is Truncating and Padding? Why do we use padding in machine learning models?
Truncating: This refers to shortening longer pieces of text to fit a certain size limit.
Padding: Adding extra data to shorter texts to reach a uniform length for processing.
To ensure that all input data has the same length