Generative AI Flashcards

1
Q

In the context of LLM prompting explain what chain of thought tree of though and graph of thought are.

A

In the context of prompting Large Language Models (LLMs), Chain of Thought (CoT), Tree of Thought (ToT), and Graph of Thought (GoT) are techniques used to guide the model’s reasoning process and improve the quality and coherence of the generated responses.

**1. Chain of Thought (CoT):
** - CoT prompting involves providing a series of intermediate reasoning steps or a sequence of thoughts as part of the prompt.
- The model is encouraged to generate a step-by-step explanation or thought process before arriving at the final answer.
- Example prompt: “Question: What is the capital of France? Let’s think step by step:
Step 1: France is a country in Europe.
Step 2: The capital of a country is usually its largest and most important city.
Step 3: Paris is the largest and most well-known city in France.
Therefore, the capital of France is Paris.”
- CoT prompting has been shown to improve the model’s ability to provide more accurate and reasoned responses, especially for tasks that require multiple reasoning steps.

2. Tree of Thought (ToT):
- ToT prompting extends the idea of CoT by organizing the reasoning steps into a tree-like structure.
- Each node in the tree represents a thought or a reasoning step, and the edges represent the dependencies or relationships between the thoughts.
- The model is prompted to generate the tree of thoughts, starting from the root node and expanding to the leaf nodes.
- Example prompt: “Question: What is the capital of France? Let’s build a tree of thoughts:
Root: France
- Node 1: France is a country in Europe.
- Node 2: Capitals
- Node 2.1: The capital of a country is usually its largest and most important city.
- Node 2.2: Paris is the largest and most well-known city in France.
Leaf: Therefore, the capital of France is Paris.”
- ToT prompting allows for more structured and hierarchical reasoning, enabling the model to break down complex problems into smaller subproblems.

3. Graph of Thought (GoT):
- GoT prompting represents the reasoning process as a graph, where nodes represent concepts or entities, and edges represent the relationships or connections between them.
- The model is prompted to generate a graph of thoughts, capturing the relevant concepts and their relationships.
- Example prompt: “Question: What is the capital of France? Let’s create a graph of thoughts:
Nodes: France, Europe, Capital, Paris
Edges:
- France is a country in Europe.
- Capital is the largest and most important city of a country.
- Paris is the largest and most well-known city in France.
- Paris is the capital of France.”
- GoT prompting allows for more flexible and expressive reasoning, enabling the model to capture complex relationships and dependencies between concepts.

These prompting techniques aim to provide more structure and guidance to the model’s reasoning process, encouraging it to generate more coherent, logical, and interpretable responses. By explicitly modeling the intermediate steps or relationships, CoT, ToT, and GoT prompting can improve the model’s performance on tasks that require complex reasoning, such as multi-step question answering, logical inference, and decision making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In the context of LLM prompting exaplin what we mean by in context learning

A

In-context learning is a method of prompt engineering that allows language models to learn tasks from a few examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is RAG?

A

RAG stands for Retrieval-Augmented Generation. It is a model architecture that combines the benefits of retriever models and generator models for tasks like question answering or dialogue.

In a RAG model, when a query (like a question) is input, a retriever component first identifies relevant context documents from a large corpus of knowledge. These documents are then provided as input to a generator model which produces a response.

The key idea behind RAG is to allow the model to dynamically select relevant information at inference time. This is in contrast to a more traditional approach where a model is trained on a fixed set of documents and cannot incorporate new information after training.

RAG models can be implemented in various ways depending on the specifics of the retriever and generator components. For example, the retriever could be a simple nearest-neighbor model or a more complex transformer-based model, and the generator could be a standard seq2seq model or a large language model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In LLMs what are adaptors?

A

In the context of Large Language Models (LLMs), “adaptors” refer to lightweight, trainable modules that are added to a pre-trained model to adapt it to specific tasks without modifying the original model parameters. Adaptors allow for task-specific fine-tuning while preserving the knowledge encoded in the pre-trained model.

An adaptor is usually a small neural network that is inserted between the layers of the pre-trained model. For example, in a transformer-based LLM, an adaptor can be added after the self-attention and feed-forward layers within each transformer block.

The primary advantage of using adaptors is that they require fewer parameters to be updated during fine-tuning, which reduces the risk of overfitting, especially when the amount of task-specific data is small. It also makes it possible to use a single pre-trained model for multiple tasks simultaneously by using different adaptors for each task, which can save computational resources.

However, adaptors may not always provide the same level of performance as full fine-tuning, particularly for tasks that are very different from the original pre-training task. The choice between using adaptors and full fine-tuning depends on factors such as the amount of task-specific data, computational resources, and the similarity between the pre-training and fine-tuning tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In the LLMS context what is LORA?

A

Low-Rank Adaptation of Large Language Models

LORA differs from adapters in its approach to modifying the model. Instead of adding new trainable parameters in the form of small neural networks (as adapters do), LORA adds low-rank matrices to the pre-existing weight matrices in the transformer layers of the model. These low-rank matrices are the only parameters that are fine-tuned during adaptation.

The key idea behind LORA is to limit the fine-tuning to a low-dimensional subspace of the parameter space, thereby making the adaptation more parameter-efficient and reducing the risk of overfitting when task-specific data is limited.

In the original LORA paper, it is shown that LORA can achieve comparable or even superior performance to full model fine-tuning on a variety of NLP tasks, while only fine-tuning a small fraction of the parameters. This makes LORA particularly useful for adapting large models where full fine-tuning is computationally prohibitive.

More specifically, in the context of a Transformer model, each layer’s original weight matrix W in the feed-forward network is augmented with a low-rank matrix UV^T, where U and V are the parameters to be fine-tuned, and their dimensions are determined by the rank of the low-rank matrix.

During fine-tuning, only the parameters of the low-rank matrices (U and V) are updated, while the original pre-trained parameters are kept frozen. This effectively constrains the fine-tuning to a low-dimensional subspace of the parameter space, reducing the risk of overfitting when task-specific data is limited.

Benefits of LORA:
Parameter Efficiency: LORA only fine-tunes a small fraction of the parameters, making it more parameter-efficient than full fine-tuning.
Reduced Overfitting: By constraining the fine-tuning to a low-dimensional subspace, LORA reduces the risk of overfitting, especially when the amount of task-specific data is small.
Performance: Despite its efficiency, LORA has been shown to achieve comparable or even superior performance to full fine-tuning on a variety of tasks.

The purpose of LORA is similar to that of adapters: both aim to adapt pre-trained models to specific tasks in a parameter-efficient manner. However, the methods they use to achieve this are different. While adapters add new trainable parameters in the form of small neural networks, LORA adds low-rank matrices to the existing weight matrices. The choice between LORA and adapters would depend on the specific requirements of the task and the resources available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the QLORA approach used in LLM?

A

QLORA stands for Quantized Layer-wise Optimal Relevance Approximation. It is a variant of LORA designed to further reduce the memory footprint and computational requirements of fine-tuning large language models.

In QLORA, the low-rank matrices added to the weight matrices in each transformer layer of a pre-trained language model are quantized. Quantization is a process that reduces the number of bits that represent a number. In the context of neural networks, quantization can significantly reduce the memory and computational requirements, at the cost of a slight decrease in model accuracy.

QLORA applies quantization to the low-rank matrices U and V in LORA. By doing so, it further reduces the number of parameters that need to be stored and fine-tuned, leading to even more efficient adaptation of pre-trained models.

However, it’s important to note that the quantization process can introduce some level of approximation error. Therefore, there is a trade-off between the efficiency gains from quantization and the potential decrease in model performance. In practice, this trade-off needs to be carefully managed to ensure that the benefits of quantization outweigh the potential downsides.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is RLHF reinforcment learning from human feedback?

A

Reinforcement Learning from Human Feedback (RLHF) is a technique to train AI models where traditional reinforcement learning is combined with valuable feedback from human evaluators.

To break down the process technically:
Initial Policy Training: An initial policy is trained using supervised learning, where human experts demonstrate correct behavior in the task environment. This is called the “expert policy”.
Data Collection: The model interacts with the environment (or a simulation of it) based on the current policy and collects trajectories of states, actions, and rewards.
Reward Modeling: Human evaluators compare different actions taken by the model in various states and rank them according to their preferences. These comparisons are used to create a reward model. It’s important to note that the humans are not rating the absolute value of an action, but rather making relative comparisons between different actions.

Policy Optimization: The model’s policy is then updated to maximize the expected cumulative reward as per the reward model. This is typically done using Proximal Policy Optimization (PPO) or similar algorithms.
Iteration: Steps 2 through 4 are repeated iteratively, refining the model’s behavior over time with continuous feedback from humans.

The RLHF process can be computationally intensive and time-consuming due to the need for continuous human involvement in reward modeling. However, it’s an effective way to train models in complex environments where it’s hard to define a suitable reward function or in scenarios where exploration can have high costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what other techniques are there apart from RLHF?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between Factual Grounding vs RAG?

A

Factual Grounding and Retrieval-Augmented Generation (RAG) are both approaches to enhance language models with external knowledge.

RAG, which stands for Retrieval-Augmented Generation,** is a specific approach to enhancing factual grounding in generative models. **It is a framework that combines a pre-trained language model with a retrieval system.

Factual grounding can be achieved with various mechanisms, including retrieval systems, knowledge bases, or even fine-tuning on factually verified datasets it can even use real time queries at inference time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Flare Forward-looking Active Retrieval Augmented Generation and how does it differ from RAG?

A

Flare (Forward-looking Active Retrieval) is an extension of the Retrieval-Augmented Generation (RAG) approach that aims to improve the efficiency and effectiveness of the retrieval process. While RAG focuses on retrieving relevant documents based on the current input query, Flare introduces a forward-looking mechanism to actively retrieve documents that are likely to be relevant for future generation steps.

Key differences between Flare and RAG:

Forward-looking Retrieval:

RAG: Retrieves documents based on the current input query only.
Flare: Introduces a forward-looking retrieval mechanism that considers the potential relevance of documents for future generation steps.
Active Retrieval:

RAG: Performs retrieval once based on the input query.
Flare: Actively retrieves documents at each generation step, taking into account the previously generated tokens and the current context.
Retrieval Strategy:

RAG: Uses a fixed retrieval strategy based on similarity scores between the query and document embeddings.
Flare: Employs a learned retrieval strategy that adapts based on the current generation context and the previously retrieved documents.
Efficiency:

RAG: Requires retrieving relevant documents for each input query, which can be computationally expensive.
Flare: Optimizes the retrieval process by actively selecting documents that are most likely to be relevant for the current and future generation steps, reducing the number of retrieval operations.
Implementation details of Flare:

Retrieval Model: Flare uses a learned retrieval model that takes the current generation context and previously retrieved documents as input and outputs a probability distribution over the documents in the corpus.

Document Encoding: The documents in the corpus are encoded into dense vector representations using an encoder (e.g., BERT) offline, similar to RAG.

Query Encoding: At each generation step, the current input query and the previously generated tokens are encoded using the same encoder to obtain a query vector.

Retrieval Probabilities: The retrieval model computes the probability of each document being relevant based on the query vector and the document vectors.

Active Retrieval: The top-k documents with the highest retrieval probabilities are actively retrieved at each generation step.

Generation: The retrieved documents and the current generation context are fed into the generative language model (e.g., GPT) to generate the next token.

Retrieval Loss: Flare introduces a retrieval loss that encourages the retrieval model to select documents that are likely to be relevant for future generation steps. This loss is typically based on the likelihood of the retrieved documents given the future generated tokens.

Flare has shown improved performance and efficiency compared to RAG on various language generation tasks, such as open-domain question answering and dialogue systems. By actively retrieving documents that are likely to be relevant for future generation steps, Flare reduces the number of retrieval operations and enhances the quality of the generated responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are sparse mixture of experts models (SMoE)

A

**Sparse Mixture of Experts (SMoE) **models are a type of neural network architecture that combines the concept of mixture of experts with sparsely activated subnetworks. The goal is to improve efficiency and scalability by selectively activating a subset of experts for each input.

Architecture:

SMoE consists of a set of expert networks and a gating network.
Each expert network is a specialized subnetwork capable of handling a specific subset of the input space.
The gating network is responsible for assigning input examples to the appropriate experts based on their characteristics.
The experts are sparsely activated, meaning that for each input, only a small subset of experts (e.g., top-k) are selected and computed.

Implementation:

  • Expert Networks: Design a set of expert networks, each being a neural network (e.g., MLP or transformer) with its own parameters.
  • Gating Network: Implement a gating network that takes the input and outputs a probability distribution over the experts. This can be achieved using a softmax layer.
  • Sparse Activation: Select the top-k experts with the highest probabilities from the gating network for each input. Only compute the forward pass for these selected experts.
  • Combination: Combine the outputs of the selected experts using a weighted sum, where the weights are determined by the gating network probabilities.
  • Training: Train the SMoE model using standard optimization techniques, such as stochastic gradient descent, to minimize the loss function.
    * Pros:

Improved efficiency by selectively activating a subset of experts for each input, reducing computational cost.
Increased model capacity and expressiveness by allowing different experts to specialize in different parts of the input space.
Scalability to large-scale datasets and complex tasks by distributing the workload among multiple experts.

Cons:
* Routing Ambiguity: The gating network may have difficulty deciding which experts to activate for certain inputs, which can lead to suboptimal performance if not properly managed.
* Increased model complexity due to the presence of multiple expert networks and the gating mechanism.
* Potential challenges in training, such as ensuring balanced expert utilization and preventing individual experts from dominating.
* Overhead in terms of memory and communication costs due to the need to store and coordinate multiple expert networks.
* Mathematical Details:

Gating Network: The gating network outputs a probability distribution over the experts using a softmax function: p_i = exp(g_i) / sum_j exp(g_j), where g_i is the logit for expert i.
Sparse Activation: Select the top-k experts based on the gating probabilities. The output of the SMoE is computed as: y = sum_i p_i * e_i(x), where e_i(x) is the output of expert i for input x.
* Loss Function: The loss function can be a combination of the task-specific loss (e.g., cross-entropy for classification) and additional regularization terms to encourage balanced expert utilization and sparsity.
SMoE models have been successfully applied to various tasks, including language modeling, machine translation, and computer vision, demonstrating improved efficiency and performance compared to traditional dense models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are some example of sparse mixture of experts models?

A

**Mixtral 8x7B
**Gshard: Developed by Google, Gshard is a model that uses a MoE architecture for efficient scaling. It uses a gating mechanism to determine which experts to use for each token in the input.
Switch Transformer: Also developed by Google, the Switch Transformer is another example of a MoE model. It achieves high efficiency and scalability by dynamically routing input tokens to a subset of experts, reducing the computational cost.
Turing-NLG: Microsoft’s Turing-NLG, a 17-billion parameter language model, also uses a MoE architecture to efficiently distribute its parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some prons and cons of sparse mixture of experts large models?

A

Sparse Mixture of Experts (MoE) models, where each input is processed by only a small subset of experts, have several advantages and disadvantages.

Pros:
1. Capacity: MoE models can have a much larger capacity than standard models, as they effectively contain many smaller models (the experts) that can each learn different things. This can lead to better performance on complex tasks.
1. Efficiency: Because each input is processed by only a few experts, MoE models can be more efficient than standard models. This is particularly true when expert parallelism is used, allowing different experts to be processed on different devices simultaneously.
1. Adaptability: MoE models can potentially adapt better to different types of input data, as different experts can specialize in different types of data or tasks.

Cons:
* Complexity: MoE models are more complex than standard models, both in terms of their architecture and their training process. This can make them harder to implement and debug.
* Overfitting: Because of their larger capacity, MoE models can be more prone to overfitting, especially if the number of experts is large relative to the amount of training data.
* Load Balancing: Ensuring that the computational load is evenly distributed across experts can be challenging, especially when some experts are used more than others. This can lead to inefficient use of computational resources.
* Training Difficulty: Training MoE models can be difficult due to issues such as expert imbalance (where some experts are used much more than others), and the need to train both the experts and the gating network that selects experts. Furthermore, traditional training methods like batch normalization do not work well with MoE models, requiring the development of new methods.

When compared to similar models that do not use Sparse Mixture of Experts (MoE), MoE models have distinct trade-offs in terms of memory usage, speed, latency, and throughput:
* Memory: MoE models can potentially use less memory per input because each input is only processed by a subset of the experts. However, the total memory usage might still be high due to the large number of parameters across all experts.
*
* Speed/Latency: The speed or latency of MoE models can be higher than traditional models if the gating network operation and the selection of experts is not optimized. However, with a well-implemented expert parallelism, the speed can be significantly improved as different experts can operate simultaneously on different hardware resources.
* Throughput: In terms of throughput, MoE models can potentially handle larger volumes of data more efficiently, thanks to expert parallelism. Each expert can process different portions of the data simultaneously, leading to higher overall throughput.
*
* Computational Cost: While theoretically, MoE models can be more efficient, the computational cost could be higher in practice due to the overhead of managing multiple experts and the gating network. This includes the cost of load balancing and coordinating between different devices in a distributed setup.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In mixture of expert implementeation of LLMs what is expert parallelism?

A

Expert parallelism in the context of Mixture of Experts (MoE) implementation in Large Language Models (LLMs) refers to the ability to distribute the computation load of individual experts across multiple devices or processors.

In a MoE model, each ‘expert’ is a smaller neural network that specializes in a particular type of data or pattern. Instead of running all computations on a single device, expert parallelism allows each expert to be computed on a different device. This is particularly useful when the number of experts is large, as it can significantly speed up computation.

The key benefit of expert parallelism is the potential for increased model capacity without a linear increase in computational cost. Because the experts can operate independently and simultaneously, larger models can be trained more efficiently.

However, implementing expert parallelism can be challenging due to the need for effective load balancing across devices (since some experts may be used more than others) and for synchronizing the updates from all the devices. Despite these challenges, expert parallelism is a powerful tool for scaling up the training of MoE models in LLMs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain with equation what changes inside a FFN transcformer block with mix of experts routing

A

In a standard Transformer model, the Feed-Forward Network (FFN) block is a simple two-layer neural network applied independently to each position in the input sequence. It can be represented by the equation:

FFN(x) = max(0, xW1 + b1)W2 + b2

Where W1, W2, b1, and b2 are the weights and biases of the two layers respectively, and x is the input.

When we introduce a mixture of experts (MoE) into the FFN, the setup changes. Instead of having one universal FFN, we now have multiple expert FFNs and a gating network that determines which expert FFN should be used for each input.

The gating network computes a distribution over the experts, based on the input. The output of the MoE is a weighted sum of the expert outputs, with the weights given by the gating network. This can be represented by the following equation:

MoE(x) = Σ (g_i(x) * FFN_i(x))

Here, FFN_i represents the i-th expert FFN, g_i(x) is the gating network’s weight for the i-th expert given input x, and the sum is over all the experts. The gating network is typically a simple feed-forward network with softmax output to ensure the weights sum to one.

The benefit of this setup is that it allows the model to adaptively choose which expert to use for each input, potentially increasing the model’s capacity without a significant increase in computational cost, as only a subset of experts need to be used for each input.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is SwiGLU Activation Function?

A

SwiGLU is an intriguing** variant of the Gated Linear Units (GLU)** activation function. Let’s dive into its details:

**What is SwiGLU?
**SwiGLU is a non-linear activation function that builds upon the GLU concept.
It employs the Swish function as its core activation mechanism.
The Swish function is defined as: Swish(x) = x * Sigmoid(x).
SwiGLU combines this Swish activation with a gating mechanism.
SwiGLU Activation Formula:
Given input x, weight matrices W and V, bias terms b and c, and a scaling factor β, SwiGLU is defined as:
SwiGLU(x, W, V, b, c, β) = Swish β (xW + b) ⊗ (xV + c)
Why SwiGLU?
Efficiency: SwiGLU offers computational efficiency while maintaining expressive power.
Non-linearity: It introduces non-linearity, which is crucial for deep neural networks.
Performance: SwiGLU has demonstrated competitive performance in various tasks, especially language modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the passkey task in LLM?

A

The passkey retrieval task in the context of Large Language Models (LLMs) is an interesting challenge. Let’s delve into it:

What is the Passkey Task?
The passkey task involves a language model (LM) retrieving a simple passkey—typically a five-digit random number—from a long, seemingly meaningless text sequence.
Essentially, the LM must identify and extract this specific passkey from within the input text.
Why Is It Important?
The passkey task serves as a benchmark to evaluate an LLM’s ability to understand and retain information across all positions in an input sequence.
It tests whether the model can effectively locate relevant patterns or tokens even in lengthy and contextually diverse texts.

Challenges and Implications:
The simplicity of the task might lead one to believe it’s trivial. However, it highlights crucial aspects:
Contextual Awareness: The LLM must be aware of information across the entire input, not just local context.
Generalization: The model should perform well on unseen passkeys, demonstrating generalizability.
Performance Degradation: If the context window is too limited, LLMs may struggle with longer input sequences.
Addressing these challenges is essential for practical applications where LLMs encounter extensive documents or large-scale conversations.
Existing Approaches:
Researchers have developed various methods to extend the context window size of pretrained LLMs:
Fine-Tuning: Some approaches involve fine-tuning LLMs on extensive texts.
Minimal Fine-Tuning: Others aim for extension without resource-intensive fine-tuning.
Notable techniques include ‘PI,’ ‘CLEX,’ and ‘Yarn.’
However, these methods often require some degree of fine-tuning, which can be time-consuming and resource-intensive.

Self-Extend Approach:
A recent proposal called Self-Extend leverages LLMs’ inherent capabilities for handling long contexts.
The basic idea:
Construct bi-level attention information: group level and neighbor level.
Computed using the original model’s self-attention, without additional training.
Requires only minimal code modification.
Self-Extend effectively extends existing LLMs’ context window without fine-tuning1.

18
Q

what is

what is supervised fine tuning? on an instruction dataset?

A

Supervised fine-tuning is a strategy used in machine learning, particularly in the training of large language models, where an already pre-trained model is further trained on a specific dataset with labelled data to tailor its predictions for specific tasks.

In the context of an instruction dataset, suppose you have a large language model, like GPT-3, that has been pre-trained on a diverse range of internet text. Now, if you want this model to perform a specific task, such as answering questions about a set of instructions, you would create an instruction dataset where each instance consists of an instruction and the correct response or action to that instruction. This dataset will serve as the labelled data.

During supervised fine-tuning, the model is further trained on this instruction dataset. The goal is to minimize the difference between the model’s predictions and the actual labels in the dataset. The model’s parameters are updated in such a way that it learns to generate responses that match the correct responses in the instruction dataset.

This way, the model can learn to perform tasks that are specific to the requirements of the instruction dataset, despite being originally trained on a much broader and diverse corpus. The power of supervised fine-tuning lies in its ability to leverage the vast general knowledge learned by the model during pre-training, and to tailor it to a specific task with a relatively small amount of task-specific data.

19
Q

wha

What is direct preference optimization DPO?

A

Direct Preference Optimization (DPO) is a reinforcement learning method that directly optimizes policy preferences stated by the user. Unlike traditional reinforcement learning methods that rely on scalar rewards, DPO uses comparative feedback. In other words, it uses trajectory pairs and preferences between them to guide the learning process.

The core idea behind DPO is that it’s often easier for humans to provide feedback on pairs of trajectories (sequences of states resulting from actions taken by the agent) than to numerically score them. For example, in a complex environment, a human might find it difficult to assign a specific reward to a given sequence of actions taken by the AI. However, they might be able to relatively easily say that one sequence of actions is better than another. DPO leverages this kind of comparative feedback to guide the learning of the AI.

DPO has been used successfully in various applications, including in robotics, where a robot learns to perform complex tasks by leveraging human feedback. It can also be used in training large language models to generate more desirable outputs by directly optimizing for human preferences.

Direct Preference Optimization (DPO) can indeed be seen as a specific way of implementing Reinforcement Learning from Human Feedback (RLHF).

20
Q

What is the Pile dataset?

A

The Pile dataset is a large text corpus for training large language models like GPT-3 and others.

It is 800GB in size and includes a diverse range of data sources such as books, websites, and other texts. These sources include but are not limited to: Common Crawl, Wikipedia, academic papers from the arXiv, books from Project Gutenberg, court rulings from the Court Listener, content from Github, and many more. The Pile was developed with the intention of training better and more versatile language models by providing a wider array of text for the model to learn from. It was designed by the EleutherAI

The process includes several steps:
Data Collection: Data is gathered from a variety of sources, ranging from academic papers to court rulings, to code from Github.
Language Identification: As the data comes from many different sources, language identification is done to ensure the dataset consists only of English texts.
De-duplication: To avoid over-representation of certain texts, duplicates are identified and removed.
Filtering: Certain types of content that are not suitable or irrelevant for training purposes are filtered out. This could include extremely long documents, certain types of metadata, non-textual content, etc.
Formatting: The data is then formatted to be compatible with the needs of large language models. This can involve splitting the data into chunks of a certain size, encoding it in a certain way, etc.
Quality Control: Various checks are performed to ensure the quality of the dataset. This could include checks for remaining duplicates, inappropriate content, etc.

21
Q

what is the needle in the haystack test for LLM models?

A

The “Needle in the Haystack” test for Language Models, particularly Large Language Models (LLMs), is a type of evaluation that assesses the model’s capability to extract a single correct or meaningful response from a vast amount of irrelevant or incorrect data - similar to finding a “needle in a haystack”.

In the context of OpenAI’s LLMs, the test might involve providing the model with a large amount of text data or prompts, with only a single piece of information or a specific phrase that is relevant or correct. The model is then tasked to find this piece of information - the “needle” - among the vast “haystack” of irrelevant data.

This test is important because it helps gauge how well the LLM can discern relevant information from noise, which is a critical aspect of understanding and generating meaningful responses.

22
Q

What are more standard techniques to adapta large pre trained model apart from adapters and Q-LORA?

A

there are several other ways to adapt large pre-trained models to new tasks. Below are a few methods:
Fine-Tuning: The traditional approach to adapt pre-trained models is through fine-tuning, where the pre-trained model is trained further on a new task-specific dataset. During this process, all parameters of the model are updated.
Feature Extraction: Another common approach is to use pre-trained models as feature extractors. Here, the model parameters are kept frozen, and the model is used to extract meaningful features from the input data, which are then fed into a separate classifier or regressor trained on the task-specific data.
Multi-Task Learning: In multi-task learning, a model is trained on multiple related tasks simultaneously. This allows the model to leverage shared information across tasks, which can improve the model’s performance on each individual task.
Continual Learning: Continual learning aims to adapt a pre-trained model to new tasks while minimizing catastrophic forgetting of the previous tasks. This is often achieved through techniques like regularization or architectural modifications that encourage the model to maintain its performance on the old tasks while learning the new task.
Knowledge Distillation: Knowledge distillation involves training a smaller student model to mimic the behavior of the large pre-trained model (teacher model). The student model is trained to match the output probabilities of the teacher model, which can help it learn a similar representation of the data while being much smaller and more computationally efficient.

23
Q

What is RAG? how it can be implemented, what are some pros and cons and alternatives

A

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative language models to enhance the model’s ability to generate informative and accurate responses by leveraging external knowledge.

Implementation:

Retrieval system: Build an efficient retrieval system (e.g., using FAISS) to store and search a large corpus of documents. One example is Implement a retrieval system like Dense Passage Retrieval (DPR) that can efficiently search and retrieve relevant documents from a large corpus such as Wikipedia.

Encoding: Encode the input query and the documents in the corpus into dense vector representations using an encoder (e.g., BERT).
Retrieval: Use the encoded query to retrieve the most relevant documents from the corpus based on similarity scores.
Generation: Feed the retrieved documents and the input query into a generative language model (e.g., GPT) to generate the final response.
Training: Train the model using a dataset of query-document-response triples, optimizing for both retrieval accuracy and generation quality.

Pros:

Enables the model to access a vast amount of external knowledge, improving the informativeness and accuracy of generated responses.
Allows for dynamic retrieval of relevant information based on the input query, providing flexibility and adaptability.
Can handle a wide range of topics and domains by leveraging a large and diverse corpus.
Cons:

Requires building and maintaining a large-scale retrieval system, which can be computationally expensive and storage-intensive.
The quality of generated responses depends on the quality and relevance of the retrieved documents.
Retrieving irrelevant or noisy documents can negatively impact the generated outputs.
The model’s performance may be limited by the coverage and freshness of the retrieval corpus.

24
Q

what are some alternatives to RAG?

A

Knowledge Distillation: Pretrain the model on a large corpus and distill the knowledge into the model’s parameters, eliminating the need for explicit retrieval during inference.
Memory Networks: Use a differentiable memory component to store and retrieve relevant information during generation.
Knowledge Bases: Direct integration of structured knowledge bases (e.g., Wikidata) with language models.
Closed-Book Models: Fine-tuning language models to contain knowledge within their parameters, with no external retrieval (e.g., T5).
Memory Networks: Incorporating a differentiable memory component that can store and retrieve facts.
Sparse Retrieval: Using traditional information retrieval techniques like TF-IDF or BM25 to fetch relevant documents, which can be more interpretable but less accurate than dense retrieval.

25
Q

What is self attention?

A
26
Q

What is and why we apply a causal mask to the self attention?

A
27
Q

What is sliding window attention? how is that related to self regressive causal mask?

A
28
Q

Which model applied a sliding window attention?

A
29
Q

How is sliding window attention similar to the receptive window in convolution networks?

A
30
Q

In a self attention mechanism ( not cross attention decoder only model) what are the KQV matrix

A
31
Q

In the context of transformer architecture what is KV-Cache?

A
32
Q

When you have a sliding windows attention and KV cache what is the rolling buffer cache?

A
33
Q

What is the pre-filling and chunking of the KV cache in the language models?

A
34
Q

What is model sharding? what can be used that is better than that?

A
35
Q

For lange language models what is pipeline parallelism?

A
36
Q

what are x-formers

A
37
Q

What is nvidia triton?

A
38
Q

balance between troughtput and latency for big model inference

A
39
Q

Waht is RTF vs throughtput

A
40
Q

what are xformers?

A
41
Q

How do companies optimize inferences to serve Large language models?

A
42
Q

Why if you need to serve multiple requests (prompts) of different lenghts it might not be a good idea to combine them by padding?

A

A lot of dot products are useless ( they are with pad ) you can use xformers ( like in Mixtral)