tech
December 23, 2025
Continuously hardening ChatGPT Atlas against prompt injection attacks
Automated red teaming—powered by reinforcement learning—helps us proactively discover and patch real-world agent exploits before they’re weaponized in the wild.

TL;DR
- ChatGPT Atlas Agent mode allows the AI to view webpages and take actions within the user's browser.
- This new capability increases vulnerability to prompt injection attacks, where malicious instructions override the agent's intended behavior.
- OpenAI has implemented a security update to Atlas's browser agent, including an adversarially trained model and strengthened safeguards.
- An LLM-based automated attacker trained with reinforcement learning is used to discover novel prompt injection attacks at scale.
- This automated attacker can steer agents into executing complex, long-horizon harmful workflows.
- A rapid response loop is in place: newly discovered attacks are used for adversarial training of agent models and to improve the broader defense stack.
- Users are advised to limit logged-in access when possible, carefully review confirmation requests, and give agents explicit, specific instructions.
Continue reading
the original article