Key Terms Flashcards

1
Q

Deliberative alignment

A

Deliberative alignment is a training method that teaches large language models (LLMs) to reason through safety specifications before responding to user prompts:

It involves training LLMs to explicitly recall and reason over safety specifications at inference time. This allows the model to generate safer responses that are calibrated to the context.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly