Amazon Bedrock - Foundation Model Evaluation Flashcards
What is Automatic Evaluation in the context of model evaluation?
A method to evaluate a model for quality control using built-in task types such as text summarization, question and answer, text classification, or open-ended text generation.
What are benchmark questions?
Questions that are used to evaluate a model against ideal answers.
What is the purpose of benchmark datasets?
To evaluate the performance of a language model by measuring accuracy, speed, efficiency, and scalability.
True or False: Benchmark datasets can help detect bias in models.
True
What does ROUGE stand for?
Recall-Oriented Understudy for Gisting Evaluation.
What is ROUGE used for?
To evaluate automatic summarization and machine translation systems.
What does ROUGE-N measure?
The number of matching n-grams between reference and generated text.
Fill in the blank: ROUGE-L computes the longest common _______ between reference and generated text.
subsequence
What is BLEU used for?
To evaluate the quality of generated text, especially for translation.
What does BLEU penalize?
It penalizes for too much brevity in translations.
What is the focus of BERTScore?
To evaluate the semantic similarity between generated text.
How does BERTScore determine semantic similarity?
By comparing the embeddings of both texts and computing the cosine similarity.
What does perplexity measure in model evaluation?
How well the model predicts the next token, with lower values indicating better performance.
What can business metrics evaluate in a model?
User satisfaction, average revenue per user, cross-domain performance, conversion rates, and efficiency.
What is the advantage of human evaluations in model assessment?
Humans can provide subjective insights and evaluations of generated answers against benchmark answers.
What are some evaluation metrics for model output?
- ROUGE
- BLEU
- BERTScore
- Perplexity
True or False: ROUGE and BLEU only look at individual words.
False
What kind of data can a generative AI model be trained on?
Clickstream data, card data, purchase items, and customer feedback.
What is a judge model in automatic evaluation?
A GenAI model that compares benchmark answers to generated answers and provides grading scores.
What is the purpose of using a feedback loop in model evaluation?
To retrain the model based on the quality of the scores from evaluation metrics.