Developing Generative AI Solutions Flashcards
Key Metrics in Defining a Use case
Cost savings
Time Savings
Quality Improvements
Customer Satifaction
Productivity Gains
Generative AI App Lifecycle
latency is the most crucial criterion for
a real-time application on resource-constrained mobile devices.
Prompt engineering refers to
the process of carefully crafting the input prompts or instructions given to the model to generate desired outputs or behaviors. aims to optimize the prompts to steer the model’s generation in the desired direction, using the model’s capabilities while mitigating potential biases or undesirable outputs.
PE: Augmentation:
Incorporating additional information or constraints into the prompts, such as examples, demonstrations, or task-specific instructions, to guide the model’s generation process
PE: TUning
*
Tuning: Iteratively refining and adjusting the prompts based on the model’s outputs and performance, often through human evaluation or automated metrics
*
PE: Ensembling:
Combining multiple prompts or generation strategies to improve the overall quality and robustness of the outputs
*
PE: Mining:
Exploring and identifying effective prompts through techniques like prompt searching, prompt generation, or prompt retrieval from large prompt libraries
PE: Design:
Crafting clear, unambiguous, and context-rich prompts that effectively communicate the desired task or output to the model
Fine-tuning
Fine-tuning refers to the process of taking a pre-trained language model and further training it on a specific task or domain-specific dataset. Fine-tuning allows the model to adapt its knowledge and capabilities to better suit the requirements of the business use case.
There are two ways to fine-tune a model:
1
Instruction fine-tuning uses examples of how the model should respond to a specific instruction. Prompt tuning is a type of instruction fine-tuning.
2
Reinforcement learning from human feedback (RLHF) provides human feedback data, resulting in a model that is better aligned with human preferences.
Fine-tuning is particularly useful when
the target task has a limited amount of training data. This is because the pre-trained model can provide a strong foundation of general knowledge, which is then specialized during fine-tuning.
Pursuing a more customized approach, such as training a model from scratch or heavily fine-tuning a pre-trained model, can potentially yield higher accuracy and better performance tailored to the specific use case. However,
this customization comes at a higher cost in terms of computational resources, data acquisition, and specialized expertise required for training and optimization.
By using these, organizations can achieve higher levels of automation, consistency, and efficiency in their cloud operations, while also improving visibility, control, and auditability of the processes involved.
By using agents for multi-step tasks,
Fine-tuning a pre-trained language model on domain-specific data is generally
the most cost-effective approach for customizing the model to a specific domain while maintaining high performance.
Benchmark Data Sets
The General Language Understanding Evaluation (GLUE)
benchmark is a collection of datasets for evaluating language understanding tasks like text classification, question answering, and natural language inference.
SuperGLUE is an extension of GLUE with
more challenging tasks and a focus on compositional language understanding.
Stanford Question Answering Dataset (SQuAD) is a dataset for
evaluating question-answering capabilities.
Workshop on Machine Translation (WMT) is a series of datasets and tasks for
evaluating machine translation systems.
Automated Metrics\automated metrics can provide a quick and scalable way to evaluate foundation model performance. These metrics typically measure specific aspects of the model’s outputs
Perplexity (a measure of how well the model predicts the next token)
BLEU score (for evaluating machine translation)
F1 score (for evaluating classification or entity recognition tasks)
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of metrics used for
evaluating automatic summarization and machine translation systems. It measures the quality of a generated summary or translation by comparing it to one or more reference summaries or translations.
Bilingual Evaluation Understudy (BLEU) is a metric used to evaluate the quality of
machine-generated text, particularly in the context of machine translation. It measures the similarity between a generated text and one or more reference translations, considering both precision and brevity.
BERTScore is a metric that evaluates
the semantic similarity between a generated text and one or more reference texts. It uses pre-trained Bidirectional Encoder Representations from Transformers (BERT) models to compute contextualized embeddings for the input texts, and then calculates the cosine similarity between them.
Human evaluators can provide qualitative feedback on factors like
coherence, relevance, factuality, and overall quality of the model’s outputs.
Metrics like ROUGE, BLEU, and BERTScore provide an initial assessment of
the foundation model’s capabilities