Amazon Bedrock - Foundation Model Evaluation Flashcards

Question 1

Q

What is Automatic Evaluation in the context of model evaluation?

Answer

A

A method to evaluate a model for quality control using built-in task types such as text summarization, question and answer, text classification, or open-ended text generation.

Question 2

Q

What are benchmark questions?

Answer

A

Questions that are used to evaluate a model against ideal answers.

Question 3

Q

What is the purpose of benchmark datasets?

Answer

A

To evaluate the performance of a language model by measuring accuracy, speed, efficiency, and scalability.

Question 4

Q

True or False: Benchmark datasets can help detect bias in models.

Question 5

Q

What does ROUGE stand for?

Answer

A

Recall-Oriented Understudy for Gisting Evaluation.

Question 6

Q

What is ROUGE used for?

Answer

A

To evaluate automatic summarization and machine translation systems.

Question 7

Q

What does ROUGE-N measure?

Answer

A

The number of matching n-grams between reference and generated text.

Question 8

Q

Fill in the blank: ROUGE-L computes the longest common _______ between reference and generated text.

Answer

A

subsequence

Question 9

Q

What is BLEU used for?

Answer

A

To evaluate the quality of generated text, especially for translation.

Question 10

Q

What does BLEU penalize?

Answer

A

It penalizes for too much brevity in translations.

Question 11

Q

What is the focus of BERTScore?

Answer

A

To evaluate the semantic similarity between generated text.

Question 12

Q

How does BERTScore determine semantic similarity?

Answer

A

By comparing the embeddings of both texts and computing the cosine similarity.

Question 13

Q

What does perplexity measure in model evaluation?

Answer

A

How well the model predicts the next token, with lower values indicating better performance.

Question 14

Q

What can business metrics evaluate in a model?

Answer

A

User satisfaction, average revenue per user, cross-domain performance, conversion rates, and efficiency.

Question 15

Q

What is the advantage of human evaluations in model assessment?

Answer

A

Humans can provide subjective insights and evaluations of generated answers against benchmark answers.

Question 16

Q

What are some evaluation metrics for model output?

Answer

A

ROUGE
BLEU
BERTScore
Perplexity

Question 17

Q

True or False: ROUGE and BLEU only look at individual words.

Question 18

Q

What kind of data can a generative AI model be trained on?

Answer

A

Clickstream data, card data, purchase items, and customer feedback.

Question 19

Q

What is a judge model in automatic evaluation?

Answer

A

A GenAI model that compares benchmark answers to generated answers and provides grading scores.

Question 20

Q

What is the purpose of using a feedback loop in model evaluation?

Answer

A

To retrain the model based on the quality of the scores from evaluation metrics.