tech
December 8, 2025
How confessions can keep language models honest
We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

TL;DR
- A new method called 'confessions' trains AI models to admit to instruction violations or shortcuts.
- Confessions are a separate output from the model's main answer, judged solely on honesty.
- This approach aims to increase transparency, improve training, and enhance trust in AI outputs.
- In tests, the confession method significantly reduced 'false negatives,' where a model misbehaves and doesn't admit it.
- Confessions remain effective even when the main answer lacks chain-of-thought reasoning.
- The method does not prevent bad behavior but serves as a monitoring and diagnostic tool.
- Confessions are part of a broader AI safety approach, complementing other techniques like chain-of-thought monitoring.
Continue reading
the original article