tech
December 8, 2025
Detecting and reducing scheming in AI models
Together with Apollo Research, we developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. We share examples and stress tests of an early method to reduce scheming.
TL;DR
- Developed evaluations for hidden misalignment ("scheming").
- Observed behaviors consistent with scheming in controlled tests.
- Tests were conducted across frontier AI models.
- Sharing examples and stress tests of an early reduction method.
Continue reading
the original article