Continuously hardening ChatGPT Atlas against prompt injection attacks

December 23, 2025

TL;DR

ChatGPT Atlas Agent mode allows the AI to view webpages and take actions within the user's browser.
This new capability increases vulnerability to prompt injection attacks, where malicious instructions override the agent's intended behavior.
OpenAI has implemented a security update to Atlas's browser agent, including an adversarially trained model and strengthened safeguards.
An LLM-based automated attacker trained with reinforcement learning is used to discover novel prompt injection attacks at scale.
This automated attacker can steer agents into executing complex, long-horizon harmful workflows.
A rapid response loop is in place: newly discovered attacks are used for adversarial training of agent models and to improve the broader defense stack.
Users are advised to limit logged-in access when possible, carefully review confirmation requests, and give agents explicit, specific instructions.

Continue reading
the original article