Large Language Model Vocabulary Flashcards
What is a T5 model?
(Made with ChatGTP)
T5 (short for “Text-To-Text Transfer Transformer”) is a large language model developed by Google that is trained to perform a wide range of natural language processing tasks, including translation, summarization, question answering, and text generation.
Released in 2020
It is a variant of the Transformer neural network architecture, which was introduced in the paper “Attention is All You Need.”
One of the key features of T5 is its ability to perform many different tasks using a single, unified model. This is achieved by training the model to take in a task description as input, along with the text to be processed, and to generate an output that is appropriate for the given task. For example, to translate text from one language to another, the model would be provided with the task description “translation” and the text to be translated, and it would generate the translation as output.
T5 has achieved state-of-the-art results on a number of natural language processing benchmarks and has been used in a variety of applications, including machine translation, language understanding, and text generation.
Some important things to know about T5 include:
It is a large-scale language model, meaning that it is trained on a very large dataset and has a very large number of parameters. This allows it to achieve high performance on a wide range of tasks, but also requires a lot of computational resources to train and use.
It is trained using a method called self-supervised learning, which means that it is trained to predict masked words in a given sentence, given the rest of the sentence as context. This allows it to learn about the structure of language and the relationships between words, which enables it to perform a wide range of tasks.
It is designed to be highly flexible and modular, allowing it to be fine-tuned for a wide range of specific tasks and applications. This makes it well-suited for use in a variety of different settings.
What is the Pile dataset?
The Pile is an 825 GiB open-source language modeling data set, made by EleutherAI, that consists of 22 smaller datasets combined.
Essentially it is a massive dataset designed to train NLP models, such as GPT-2 and GPT-3, among others. The dataset is open-source, contains over 800GB of English language data, and is still growing.
The importance of Pile is the diversity in its data sources that improves general cross-domain knowledge as well as downstream NLP tasks.
Why is the Pile a good training set?
Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.
Why is the Pile a good benchmark?
To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.
What does GTP stand for?
“GPT” is short for generative pre-trained transformer
GPT-J
GPT-J is an open source, autoregressive language model created by a group of researchers called EleutherAI.
The GPT-J model was released in the kingoflolz/mesh-transformer-jax repository by Ben Wang and Aran Komatsuzaki.
It is a GPT-2-like causal language model trained on the Pile dataset.
There or multiple versions, such as GTP-J-6B, which is one of the most advanced alternatives to OpenAI’s GPT-3 and performs well on a wide array of natural language tasks such as chat, summarization, and question answering, to name a few.
For a deeper dive, GPT-J is a transformer model trained using Ben Wang’s Mesh Transformer JAX. “GPT” is short for generative pre-trained transformer, “J” distinguishes this model from other GPT models, and “6B” represents the 6 billion trainable parameters.
Transformers
A transformer is a type of neural network architecture that first appeared in 2017 (in paper “Attention Is All You Need.” ) originally designed for natural language processing tasks, but is rapidly taking on vision and other new areas as well.
GTP is an example
In other approaches to AI, the system would first focus on local patches of input data and then build up to the whole. In a language model, for example, nearby words would first get grouped together. The transformer, by contrast, runs processes so that every element in the input data connects, or pays attention, to every other element. Researchers refer to this as “self-attention.” This means that as soon as it starts training, the transformer can see traces of the entire data set.
This is a game changer in the world of deep learning architectures because it eliminates the need for recurrent connections and convolutions.
For example in NLP, the key innovation of the transformer architecture is the use of self-attention mechanisms, which allow the model to consider the relationships between all the input words simultaneously, rather than processing them in a sequential manner as is done in traditional recurrent neural networks. This allows the model to capture long-range dependencies in the input data and to perform well on tasks that require a global understanding of the input.
Before transformers came along, progress on AI language tasks largely lagged behind developments in other areas. “In this deep learning revolution that happened in the past 10 years or so, natural language processing was sort of a latecomer,” said the computer scientist Anna Rumshisky of the University of Massachusetts, Lowell. “So NLP was, in a sense, behind computer vision. Transformers changed that.”
Transformers quickly became the front-runner for applications like word recognition that focus on analyzing and predicting text. It led to a wave of tools, like OpenAI’s Generative Pre-trained Transformer 3 (GPT-3), which trains on hundreds of billions of words and generates consistent new text to an unsettling degree.
ViT
Vision Transformer, or ViT, was the first model to use transformer approach in computer vision tasks.
.Alexey Dosovitskiy, a computer scientist then at Google Brain Berlin, was working on computer vision, the AI subfield that focuses on teaching computers how to process and classify images. Like almost everyone else in the field, he worked with convolutional neural networks (CNNs), which for years had propelled all major leaps forward in deep learning and especially in computer vision.
Aleksey was working was working on one of the biggest challenges in the field, which was to scale up CNNs to train on ever-larger data sets representing images of ever-higher resolution without piling on the processing time. He thought why can’t transformers work on images as well?
The eventual result was a network dubbed the Vision Transformer, or ViT, which the researchers presented at a conference in May 2021. The architecture of the model was nearly identical to that of the first transformer proposed in 2017, with only minor changes allowing it to analyze images instead of words. “Language tends to be discrete,” said Rumshisky, “so a lot of adaptations have to discretize the image.”
HCI
Human computer interaction
Data poisoning
Concept of bad actors posting data online that can lead to damage to the value of models trained. Relevant in
DALL-E 2
A text to image generation model released by OpenAI January 2022. Dalle2 is a great example of the power of diffusion models in deep learning.
At the highest level, DALL-E 2’s works very simply:
First, a text prompt is input into a text encoder that is trained to map the prompt to a representation space.
Next, a model called the prior maps the text encoding to a corresponding image encoding that captures the semantic information of the prompt contained in the text encoding.
Finally, an image decoder stochastically generates an image which is a visual manifestation of this semantic information.
How does DALL-E 2 know how a textual concept like “teddy bear” is manifested in the visual space? The link between textual semantics and their visual representations in DALL-E 2 is learned by another OpenAI model called CLIP (Contrastive Language-Image Pre-training).
After training, the CLIP model is frozen and DALL-E 2 moves onto its next task - learning to reverse the image encoding mapping that CLIP just learned.
CLIP learns a representation space in which it is easy to determine the relatedness of textual and visual encodings, but our interest is in image generation. We must therefore learn how to exploit the representation space to accomplish this task.
In particular, OpenAI employs a modified version of another one of its previous models, GLIDE, to perform this image generation. The GLIDE model learns to invert the image encoding process in order to stochastically decode CLIP image embeddings.
The goal is not to build an autoencoder and exactly reconstruct an image given its embedding, but to instead generate an image which maintains the salient features of the original image given its embedding. In order perform this image generation, GLIDE uses a Diffusion Model.
CLIP
OpenAI model called CLIP (Contrastive Language-Image Pre-training) is a contrastive model that is able to relate how much given text relates to an image.
It is trained on hundreds of millions of images and their associated captions, learning how much a given text snippet relates to an image. That is, rather than trying to predict a caption given an image, CLIP instead just learns how related any given caption is to an image. This contrastive rather than predictive objective allows CLIP to learn the link between textual and visual representations of the same abstract object.
The entire DALL-E 2 model hinges on CLIP’s ability to learn semantics from natural language.
GLIDE
A diffusion model from OpenAI, which was one of the first diffusion models modified to allow for text-conditional image generation.
GLIDE extends the core concept of Diffusion Models by augmenting the training process with additional textual information, ultimately resulting in text-conditional image generation. Let’s take a look at the training process for GLIDE:
Glide is used by Dalle2 in order to generate images.
The GLIDE model learns to invert the image encoding process in order to stochastically decode CLIP image embeddings.
modified-GLIDE generation model maps from the representation space into the image space via reverse-Diffusion, generating one of many possible images that conveys the semantic information within the input caption.
Diffusion model
Diffusion models are thermodynamics-inspired recent invention growing significantly in popularity in recent years.
Diffusion Models learn to generate data by reversing a gradual noising process.
Popular diffusion models include Open AI’s Dall-E 2 (the GLIDE part), Google’s Imagen, and Stability AI’s Stable Diffusion.
At a high level, Diffusion models work by destroying training data by adding noise and then learn to recover the data by reversing this noising process. In Other words, Diffusion models can generate coherent images from noise.
After trained, the model is “cut in half” and can apply the de-noising process on random noise seeds to generate images
If combined with text-to-image guidance, these models can be used to create a near-infinite variety of images from text alone by conditioning the image generation process. Inputs from embeddings like CLIP can guide the seeds to provide powerful text-to-image capabilities.
What about diffusion models makes them so strikingly different from their predecessors? The most apparent answer is their ability to generate highly realistic imagery and match the distribution of real images better than GANs.
In-painting
Inpainting is an image editing (or non image) capability that enables users to modify regions of an image and replace those regions with generated content from an associated diffusion model.
In images, you start with a real-world or generated image. The area to change is “masked” which means basically labeling it like when marking and area with a blemish. A diffusion model can then fill the regions in… for example you can use text conditioned model like GLIDE to generate a new jacket on a model in an image by first labeling the jacket (the mask) and then having new images generated.
Out-painting
Use of a diffusion model to extend a sample (such as an image) “beyond its borders” such as adding more scene to a photo. A diffusion model is used to generate this, can be text conditioned like DALLE-2.
Outpainting requires an extra layer of prompt refinement in order to generate coherent scenes, but enables you to quickly create large images that would take significantly longer to create with traditional methods.
Multi-model
In the context of foundation models, “multi-modal” typically refers to models that are able to process and understand multiple types of data, such as text, images, and audio. These models are able to take in multiple forms of input and use this information to make decisions or predictions.
For example, a multi-modal foundation model might be trained to understand both the text of a news article and the images that accompany it, in order to accurately classify the content of the article. This ability to process and understand multiple types of data can make multi-modal models more powerful and versatile than models that can only process a single type of data.