Key Terms Flashcards
1
Q
Deliberative alignment
A
Deliberative alignment is a training method that teaches large language models (LLMs) to reason through safety specifications before responding to user prompts:
It involves training LLMs to explicitly recall and reason over safety specifications at inference time. This allows the model to generate safer responses that are calibrated to the context.
2
Q
A