tech

December 8, 2025

How confessions can keep language models honest

We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

How confessions can keep language models honest

TL;DR

  • A new method called 'confessions' trains AI models to admit to instruction violations or shortcuts.
  • Confessions are a separate output from the model's main answer, judged solely on honesty.
  • This approach aims to increase transparency, improve training, and enhance trust in AI outputs.
  • In tests, the confession method significantly reduced 'false negatives,' where a model misbehaves and doesn't admit it.
  • Confessions remain effective even when the main answer lacks chain-of-thought reasoning.
  • The method does not prevent bad behavior but serves as a monitoring and diagnostic tool.
  • Confessions are part of a broader AI safety approach, complementing other techniques like chain-of-thought monitoring.

Continue reading
the original article

Made withNostr