Amazon Bedrock - Foundation Model Evaluation Flashcards

1
Q

What is Automatic Evaluation in the context of model evaluation?

A

A method to evaluate a model for quality control using built-in task types such as text summarization, question and answer, text classification, or open-ended text generation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are benchmark questions?

A

Questions that are used to evaluate a model against ideal answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of benchmark datasets?

A

To evaluate the performance of a language model by measuring accuracy, speed, efficiency, and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

True or False: Benchmark datasets can help detect bias in models.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What does ROUGE stand for?

A

Recall-Oriented Understudy for Gisting Evaluation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is ROUGE used for?

A

To evaluate automatic summarization and machine translation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does ROUGE-N measure?

A

The number of matching n-grams between reference and generated text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Fill in the blank: ROUGE-L computes the longest common _______ between reference and generated text.

A

subsequence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is BLEU used for?

A

To evaluate the quality of generated text, especially for translation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does BLEU penalize?

A

It penalizes for too much brevity in translations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the focus of BERTScore?

A

To evaluate the semantic similarity between generated text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does BERTScore determine semantic similarity?

A

By comparing the embeddings of both texts and computing the cosine similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does perplexity measure in model evaluation?

A

How well the model predicts the next token, with lower values indicating better performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can business metrics evaluate in a model?

A

User satisfaction, average revenue per user, cross-domain performance, conversion rates, and efficiency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the advantage of human evaluations in model assessment?

A

Humans can provide subjective insights and evaluations of generated answers against benchmark answers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some evaluation metrics for model output?

A
  • ROUGE
  • BLEU
  • BERTScore
  • Perplexity
17
Q

True or False: ROUGE and BLEU only look at individual words.

18
Q

What kind of data can a generative AI model be trained on?

A

Clickstream data, card data, purchase items, and customer feedback.

19
Q

What is a judge model in automatic evaluation?

A

A GenAI model that compares benchmark answers to generated answers and provides grading scores.

20
Q

What is the purpose of using a feedback loop in model evaluation?

A

To retrain the model based on the quality of the scores from evaluation metrics.