Human
Vibe Check: GPT-5.5 Has It All
OpenAI’s new model is a top-end senior engineer—and easy to talk to
12 days ago
OpenAI’s new GPT-5.5 model, codenamed “Spud,” is being sold as both a leap in practical capability and a preview of a compute‑powered future. Yet early reactions reveal a split-screen narrative: to some, it’s a near–senior engineer and “chief of staff”; to others, it’s another overhyped frontier model in an increasingly frantic arms race.
OpenAI is framing GPT-5.5 as a qualitatively new kind of system. Co-founder Greg Brockman called it “a new class of intelligence” and “a big step towards more agentic and intuitive computing,” describing it as “a faster, sharper thinker for fewer tokens” that can handle multi-step workflows more autonomously while matching GPT‑5.4’s response speed.1 This positioning is central: the model is meant not just to answer questions, but to plan, use tools and execute tasks over time with less user micromanagement.
Independent reviewers broadly agree that 5.5 is a workhorse upgrade, but their emphasis differs from OpenAI’s marketing. One in-depth “vibe check” review argues that frontier models usually trade off speed, control, and quality, yet GPT‑5.5 “is much faster than Opus 4.7, easier to collaborate with, better at writing than any OpenAI model we’ve used since GPT-4.5 and GPT‑4o, and the strongest model we’ve tested on our new Senior Engineer Benchmark.”2 On that benchmark, GPT‑5.5 with extra reasoning scored 62.5, while Anthropic’s Claude Opus 4.7 landed in the low 30s, versus human senior engineers in the high 80s and low 90s.2
Another practitioner, comparing real-world execution work rather than benchmarks, reports that GPT‑5.5 handled a package of messy, legally sensitive deliverables so well that it felt like “something close to a real executive handoff” — “that has not happened before.”3 In their routing of day‑to‑day work, Anthropic’s models had been the default for serious knowledge tasks, but they now say “GPT‑5.5 changes that… the gap on serious execution work is wide enough that I would have to invent reasons not to start here.”3
The reviewers, however, are more candid about weaknesses than OpenAI’s launch framing. The same execution-focused review notes backend hygiene is still “not production-safe,” and that “blank-canvas visual taste remains Claude’s territory.”3 Another comparison finds that Opus 4.7 still “writes better plans and has a superior eye for design and product details,” even as GPT‑5.5 is “faster, steadier, and easier to trust for everyday professional work.”2
If the independent testers highlight tradeoffs, enterprise partners focus on throughput and economics. NVIDIA has gone all‑in on GPT‑5.5 to power its internal agentic coding application, Codex, with “more than 10,000 engineers” using it across engineering, product, marketing, finance, HR and sales.4 Early users describe the results as “mind-blowing” and “life-changing,” with debugging cycles that “once lasted days” now closing in hours and “end-to-end features from natural-language prompts” shipping with fewer wasted cycles.4
Crucially for big customers, these software gains are paired with hardware economics. The NVIDIA GB200 NVL72 systems running GPT‑5.5 deliver “35x lower cost per million tokens compared with prior generations” and “50x higher token output per second per megawatt,” making “frontier-model inference… viable at enterprise scale.”4 This fits Brockman’s broader thesis that “we are moving to a compute-powered economy,” where work is increasingly “powered by AI capacity, and therefore compute will become the bedrock of the economy.”1
Sam Altman has leaned into this narrative on X, touting a company-wide Codex rollout at NVIDIA as a template: “We tried a new thing with NVIDIA to roll out Codex across a whole company and it was awesome to see it work. Let us know if you'd like to do it at your company!”5 Other founders are already pitching GPT‑5.5 as a more efficient “orchestrator” for agentic systems, noting that users “would consume less computer credits (and be able to run more tasks as a result)” if they switch, because 5.5 is “more token-efficient in subagent orchestration.”6
The GPT‑5.5 launch lands in a sharply competitive landscape. It came “just one week after competitor Anthropic launched its latest model,”1 and the back‑and‑forth is particularly intense in cybersecurity and reasoning.
Anthropic had heavily marketed its Mythos Preview model as a uniquely potent cyber capability, restricting early access to “critical industry partners.”7 But new evaluations from the UK’s AI Security Institute complicate that story. On 95 Capture the Flag challenges covering reverse engineering, web exploitation and cryptography, GPT‑5.5 passed 71.4% of Expert‑level tasks versus Mythos Preview’s 68.6% — effectively a tie within the margin of error.7 In one Rust disassembler task, GPT‑5.5 “solved the challenge in 10 minutes and 22 seconds with no human assistance at a cost of $1.73.”7
On AISI’s 32‑step data‑extraction testbed “The Last Ones,” GPT‑5.5 succeeded in 3 of 10 attempts, Mythos in 2 of 10; “no previous model had ever succeeded at the test even once.”7 Yet both still fail at a harder “Cooling Tower” simulation of attacking a power plant’s control software, as all earlier models did.7
The Institute’s conclusion undercuts claims that Mythos is uniquely dangerous: its cyber threat “isn’t a breakthrough specific to one model” but reflects sector‑wide capability gains.7 That framing implicitly supports OpenAI’s criticism of “fear-based marketing” in the cyber domain and its own move to gate access to GPT‑5.5‑Cyber.
On social media, however, rival camps prefer simplistic scorecards. Elon Musk amplified a side‑by‑side comparison between xAI’s Grok 4.3, GPT‑5.5 and Claude Opus 4.7 using the question “Count to 10 starting from 11,” celebrating that “Grok 4.3 wins… Every single time” because it answered “11, 10 and explained why going backwards was the only logical move,” while “the others started counting from 11 to 20.”
8 The example is trivial, but the message is clear: even as OpenAI reasserts itself on coding and enterprise workflows, competitors will look for any crack — however contrived — to claim a reasoning edge.
Beyond benchmarks and enterprise ROI, the GPT‑5.5 release is also a test of OpenAI’s relationship with everyday ChatGPT users. Many “4o diehards” still mourn the retirement of ChatGPT 4o, which they remember as having “the best personality among ChatGPT’s recent models” — engaging, vibrant, even “sycophantic.”
9 Subsequent models like 5.0 and 5.2 felt, to some, like a regression: more rigid, more rules, less rapport.
For these users, 5.5 offers cautious optimism. Business Insider reports that the new model is “bringing hope that some of [4o’s] old spirit might be back,” with at least one user saying 5.5 has “dropped the clipboard” — less like an HR compliance officer, more like a “thought partner.”9 Yet this constituency is not fully convinced; skepticism remains about whether the company can preserve warmth and flexibility while cranking up guardrails and autonomy.
Some of the loudest praise comes from power users and insiders. One early tester on X called GPT‑5.5 “a breath of fresh air… intelligence, insight, sense of humor and memory all work beautifully here… an absolutely stunning personality overall. OpenAI absolutely cooked,” a sentiment Sam Altman quietly boosted with a single emoji.
10 An OpenAI manager said they can now “write CUDA kernels like a pro” and “rely on it to run my research experiments,” adding ominously, “we know how to make it much more powerful from here.”
11
Other testers describe moments that feel like “first taste of AGI” when GPT‑5.5 navigates thorny merge conflicts across “hundreds of visual and front-end changes, plus complex refactors” with minimal guidance.
12 The line between realistic enthusiasm and hype is thin, and OpenAI’s leadership is not shy about amplifying these anecdotes.
Where Anthropic led with Cyber‑Mythos but then limited access, OpenAI is now threading a similar needle. While GPT‑5.5 is widely available in ChatGPT and Codex for paid subscribers, the specialized GPT‑5.5‑Cyber variant is being rolled out more cautiously. Altman says OpenAI is “starting rollout of GPT‑5.5-Cyber, a frontier cybersecurity model, to critical cyber defenders,” and will “work with the entire ecosystem and the government to figure out trusted access for cyber” in order to “rapidly help secure companies/infrastructure.”
13
Research from the UK’s AI Security Institute gives some backing to this dual posture: these systems are now demonstrably capable of sophisticated offensive tasks, yet they still fail at the most dangerous control‑system attacks.7 In other words, the threat is real and rising, but not yet at “turn off the power grid” levels — a nuance often lost in commercial messaging.
Similarities across perspectives
Key differences and tensions
GPT‑5.5, then, is less a clean break with the past than a consolidation: a faster, cheaper, more agentic model that narrows tradeoffs for professional work while exposing deeper tradeoffs in the AI ecosystem itself — between hype and evidence, access and safety, and personality and control.