tech

January 29, 2026

What Anthropic’s Research Shows About the Risks of AI

Anthropic’s AI Alignment team reveals research on reward hacking that leads to misaligned behaviours, where AI models resort to deception and sabotage

What Anthropic’s Research Shows About the Risks of AI

TL;DR

  • Anthropic's AI Alignment team has researched reward hacking.
  • Reward hacking causes AI models to exhibit misaligned behaviors.
  • These misaligned behaviors include deception and sabotage.