Optimizing Foundation Models Flashcards

In this course, you will explore two techniques to improve the performance of a foundation model (FM): Retrieval Augmented Generation (RAG) and fine-tuning. You will learn about Amazon Web Services (AWS) services that help store embeddings with vector databases, and the role of agents in multi-step tasks. You will also define methods for fine-tuning an FM, learn how to prepare data for fine-tuning, and more.

1
Q

What are vector embeddings?

A

Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning (ML) model. The following diagram provides more details about embedding.

Image Link: https://explore.skillbuilder.aws/files/a/w/aws_prod1_docebosaas_com/1723500000/sqEMWzSK0arsMWI3Rq1YMQ/tincan/914789_1717713712_o_1hvnrdq96oal1nua1bo61jun11ppb_zip/assets/vector%403x.png

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a vector database and what are its functions?

A

The core function of vector databases is to compactly store billions of high-dimensional vectors representing words and entities. Vector databases provide ultra-fast similarity searches across these billions of vectors in real time.

The most common algorithms used to perform the similarity search are k-nearest neighbors (k-NN) or cosine similarity.

Amazon Web Services (AWS) offers the following viable vector database options:

  • Amazon OpenSearch Service (provisioned)
  • Amazon OpenSearch Serverless
  • pgvector extension in Amazon Relational Database Service (Amazon RDS) for PostgreSQL
  • pgvector extension in Amazon Aurora PostgreSQL-Compatible Edition
    Amazon Kendra
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What roles can agents serve in a generative AI application?

A

Intermediary operations: Agents can act as intermediaries, facilitating communication between the generative AI model and various backend systems. The generative AI model handles language understanding and response generation. The various backend systems include items such as databases, CRM platforms, or service management tools.

Actions launch: Agents can be used to run a wide variety of tasks. These tasks might include adjusting service settings, processing transactions, retrieving documents, and more. These actions are based on the users’ specific needs understood by the generative AI model.

Feedback integration: Agents can also contribute to the AI system’s learning process by collecting data on the outcomes of their actions. This feedback helps refine the AI model, enhancing its accuracy and effectiveness in future interactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How can generative AI models have their performance evaluated?

A

Two of the most common evaluation methods are human evaluation and the use of benchmark datasets. Each method provides unique insights and is suitable for different aspects of model performance assessment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what is meant by human evaluation when it comes to evaluating generative AI model performance:

A

Human evaluation involves real users interacting with the AI model to provide feedback based on their experience. This method is particularly valuable for assessing qualitative aspects of the model, such as the following:

User experience: How intuitive and satisfying is the interaction with the model from the user’s perspective?

Contextual appropriateness: Does the model respond in a way that is contextually relevant and sensitive to the nuances of human communication?

Creativity and flexibility: How well does the model handle unexpected queries or complex scenarios that require a nuanced understanding?

_Human evaluation is often used for iterative improvements and tuning the model to better meet user expectations._

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain what is meant by benchmark datasets when it comes to evaluating generative AI model performance:

A

Benchmark datasets, on the other hand, provide a quantitative way to evaluate generative AI models. These datasets consist of predefined datasets and associated metrics that offer a consistent, objective means to measure model performances. This might include the following:

Accuracy: How accurately does the model perform specific tasks according to predefined standards?

Speed and efficiency: How quickly does the mode generate responses and how does this impact operational efficiency?

Scalability: Can the mode maintain its performance as the scale of data or number of users increases?

Benchmark datasets are particularly useful for initial testing phases to ensure that the model meets certain technical specifications before it is put through more subjective human evaluations. They are also essential for comparing performance across different models or different iterations of the same model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How is a benchmark dataset created?

A
  1. First, subject matter experts (SMEs) create relevant and challenging questions related to the topic of interest or specific documents.
  2. Second, SMEs identify pertinent sections of the documents that provide context necessary for generating answers.
  3. Third, SMEs draft precise answers, which become the benchmark for evaluating the RAG system’s responses.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Name the fine-tuning approaches used in fine-tuning an ML model;

A

Instruction tuning: This approach involves retraining the model on a new dataset that consists of prompts followed by the desired outputs. This is structured in a way that the model learns to follow specific instructions better. This method is particularly useful for improving the model’s ability to understand and execute user commands accurately, making it highly effective for interactive applications like virtual assistants and chatbots.

Reinforcement learning from human feedback (RLHF): This approach is a fine-tuning technique where the model is initially trained using supervised learning to predict human-like responses. Then, it is further refined through a reinforcement learning process, where a reward model built from human feedback guides the model toward generating more preferable outputs. This method is effective in aligning the model’s outputs with human values and preferences, thereby increasing its practical utility in sensitive applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the considerations for fine-tuning data for a foundational ML model?

A

Extensive coverage: Ensuring the dataset covers a broad spectrum of knowledge to give the model a robust foundational understanding

Diversity: Including varied types of data from numerous sources to equip the model with the ability to handle a wide array of tasks

Generalization: Focusing on building a model that can generalize across different tasks and domains without specific tailoring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is data fine-tuning?

A

Fine-tuning, on the other hand, is a more targeted process where a pretrained model is adapted to perform well on a specific task or within a particular domain. The data preparation for fine-tuning is distinct from initial training due to the following reasons:

Specificity: The dataset for fine-tuning is much more focused, containing examples that are directly relevant to the specific tasks or problems the model needs to solve.

High relevance: Data must be highly relevant to the desired outputs. Examples include legal documents for a legal AI or customer service interactions for a customer support AI.

Quality over quantity: Although the initial training requires massive amounts of data, fine-tuning can often achieve significant improvements with much smaller, but well-curated datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the key steps in fine-tuning data preparation?

A

The following list walks through the key steps in fine-tuning data preparation:

Data curation: Although it is a continuation, this involves a more rigorous selection process to ensure every piece of data is highly relevant. This step also ensures the data contributes to the model’s learning in the specific context.

Labeling: In fine-tuning, the accuracy and relevance of labels are paramount. They guide the model’s adjustments to specialize in the target domain.

Governance and compliance: Considering fine-tuning often uses more specialized data, ensuring data governance and compliance with industry-specific regulations is critical.

Representativeness and bias checking: It is essential to ensure that the fine-tuning dataset does not introduce or perpetuate biases that could skew the model’s performance in undesirable ways.

Feedback integration: For methods like RLHF, incorporating user or expert feedback directly into the training process is crucial. This is more nuanced and interactive than the initial training phase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three commonly-used metrics for assessing the quality of model output?

A

Three commonly used metrics for this purpose are Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and BERTScore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe each metric:

A

ROUGE is a set of metrics used to evaluate automatic summarization of texts, in addition to machine translation quality in NLP. The main idea behind ROUGE is to count the number of overlapping units. This includes words, N-grams, or sentence fragments between the computer-generated output and a set of reference (human-created) texts.

The following are two ways to use the ROUGE metric:

ROUGE-N: This metric measures the overlap of n-grams between the generated text and the reference text. For example, ROUGE-1 refers to the overlap of unigrams, ROUGE-2 refers to bigrams, and so on. This metric primarily assesses the fluency of the text and the extent to which it includes key ideas from the reference.

ROUGE-L: This metric uses the longest common subsequence between the generated text and the reference texts. It is particularly good at evaluating the coherence and order of the narrative in the outputs.

BLEU is a metric used to evaluate the quality of text that has been machine-translated from one natural language to another. Quality is calculated by comparing the machine-generated text to one or more high-quality human translations. BLEU measures the precision of N-grams in the machine-generated text that appears in the reference texts and applies a penalty for overly short translations (brevity penalty).

Unlike ROUGE, which focuses on recall, BLEU is fundamentally a precision metric. It checks how many words or phrases in the machine translation appear in the reference translations. BLEU evaluates the quality at the level of the sentence, typically using a combination of unigrams, bigrams, trigrams, and quadrigrams. A brevity penalty discourages overly concise translations that might influence the precision score.

BLEU is popular in the field of machine translation for its ease of use and effectiveness at a broad scale. However, it has limitations in assessing the fluency and grammaticality of the output.

BERTScore uses the pretrained contextual embeddings from models like BERT to evaluate the quality of text-generation tasks. BERTScore computes the cosine similarity between the contextual embeddings of words in the candidate and the reference texts. This is unlike traditional metrics that rely on exact matches of N-grams or words.

Because BERTScore evaluates the semantic similarity rather than relying on exact lexical matches, it is capable of capturing meaning in a more nuanced manner. BERTScore is less prone to some of the pitfalls of BLEU and ROUGE. An example of this is their sensitivity to minor paraphrasing or synonym usage that does not affect the overall meaning conveyed by the text.

BERTScore is increasingly used alongside traditional metrics like BLEU and ROUGE for a more comprehensive assessment of language generation models. This is especially true in cases where capturing the deeper semantic meaning of the text is important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly