tech

December 23, 2025

Continuously hardening ChatGPT Atlas against prompt injection attacks

Automated red teaming—powered by reinforcement learning—helps us proactively discover and patch real-world agent exploits before they’re weaponized in the wild.

Continuously hardening ChatGPT Atlas against prompt injection attacks

TL;DR

  • ChatGPT Atlas Agent mode allows the AI to view webpages and take actions within the user's browser.
  • This new capability increases vulnerability to prompt injection attacks, where malicious instructions override the agent's intended behavior.
  • OpenAI has implemented a security update to Atlas's browser agent, including an adversarially trained model and strengthened safeguards.
  • An LLM-based automated attacker trained with reinforcement learning is used to discover novel prompt injection attacks at scale.
  • This automated attacker can steer agents into executing complex, long-horizon harmful workflows.
  • A rapid response loop is in place: newly discovered attacks are used for adversarial training of agent models and to improve the broader defense stack.
  • Users are advised to limit logged-in access when possible, carefully review confirmation requests, and give agents explicit, specific instructions.

Continue reading
the original article

Made withNostr