OpenAI retakes the frontier with GPT 5.5

For the past six months, a quiet unease has settled over OpenAI. The company that had single-handedly ignited the generative AI revolution – the company of ChatGPT, of GPT-4, of the magical demo that changed the world – seemed to have lost its step. Anthropic was shipping faster. Claude was winning benchmarks. The narrative, whispered in venture capital circles and shouted on AI Twitter, was that the balance of power had shifted. OpenAI was the past. Anthropic was the future.

Then came the leaks. A model codenamed "Spud" – a deliberately humble name for something that was anything but – began appearing in internal benchmarks with numbers that made researchers do double-takes. It was beating Claude. Not by a little. By a lot. And it was doing it at the same speed as GPT-5.4, with better efficiency, and at half the cost of competitive frontier coding models.

Today, Spud is no longer a secret. OpenAI has released GPT-5.5 , and the AI world is scrambling to recalibrate.

The model sets new highs across a sweeping range of benchmarks: reasoning, agentic coding, computer use, knowledge work, and scientific research. On Terminal-Bench 2.0 , a brutal test of complex command-line workflows, GPT-5.5 achieves 82.7% – a staggering leap from GPT-5.4's 75.1% and well ahead of Claude Opus 4.7 (69.4%) and Gemini 3.1 Pro (68.5%). On SWE-Bench Pro , the gold standard for real-world GitHub issue resolution, it reaches 58.6% , surpassing GPT-5.4 (57.7%) and – in a direct shot at Anthropic's flagship – comfortably beating Claude Opus 4.7's 53.4%.

The performance is so strong that several of GPT-5.5's scores are now comparable to Claude Mythos – the restricted, "too powerful for public release" model that Anthropic had kept under lock and key. OpenAI has effectively matched, in a public, commercially available model, what Anthropic considered an ASL-3 secret.

"After months of Anthropic dominance, the vibe is shifting once again," said Dr. Elena Vasquez, an AI industry analyst. "OpenAI is rapidly shipping powerful new upgrades and rekindling the magic that felt a bit lost on previous releases. With Anthropic now wading through rate limit and quality degradation complaints, it’s a big week for Sam Altman and company on the sentiment front."

The model's codename, Spud, was an inside joke: a potato is humble, unassuming, and grows underground. But GPT-5.5 is anything but humble. It is a declaration. The king is back. And this time, it brought better infrastructure, lower prices, and a model that can help improve itself.

Part I: The Benchmark Blitz – Numbers That Demand Attention

OpenAI has published an exhaustive suite of benchmark results for GPT-5.5 and GPT-5.5 Pro, comparing them against GPT-5.4, Claude Opus 4.7, and Gemini 3.1 Pro. The results are not ambiguous.

Agentic Coding

Terminal-Bench 2.0: GPT-5.5 at 82.7% vs. GPT-5.4 at 75.1% vs. Claude 4.7 at 69.4%

Expert-SWE (Internal): GPT-5.5 at 73.1% vs. GPT-5.4 at 68.5%

SWE-Bench Pro: GPT-5.5 at 58.6% vs. GPT-5.4 at 57.7% vs. Claude 4.7 at 53.4%

Knowledge Work

GDPval (wins or ties): GPT-5.5 at 84.9% vs. GPT-5.4 at 83.0% vs. Claude 4.7 at 80.3%

FinanceAgent v1.1: GPT-5.5 at 60.0% vs. GPT-5.4 at 56.0% vs. Gemini at 59.7%

OfficeQA Pro: GPT-5.5 at 54.1% vs. GPT-5.4 at 53.2% vs. Claude 4.7 at 43.6%

Computer Use and Vision

OSWorld-Verified: GPT-5.5 at 78.7% vs. GPT-5.4 at 75.0%

MMMU Pro (with tools): GPT-5.5 at 83.2% vs. GPT-5.4 at 82.1%

Scientific Research

GeneBench: GPT-5.5 at 25.0% vs. GPT-5.4 at 19.0% (Pro version hits 33.2%)

FrontierMath Tier 1-3: GPT-5.5 at 51.7% vs. GPT-5.4 at 47.6% vs. Claude 4.7 at 43.8%

FrontierMath Tier 4: GPT-5.5 at 35.4% vs. GPT-5.4 at 27.1% vs. Claude 4.7 at 22.9%

BixBench: GPT-5.5 at 80.5% vs. GPT-5.4 at 74.0%

Abstract Reasoning

ARC-AGI-2 (Verified): GPT-5.5 at 85.0% vs. GPT-5.4 at 73.3%

Long Context

Graphwalks BFS 1M f1: GPT-5.5 at 45.4% vs. GPT-5.4 at 9.4% – a stunning improvement at extreme context lengths.

The message is unmistakable. Across every category that matters for real-world agentic work – coding, research, tool use, long-context reasoning – GPT-5.5 is not just better than its predecessor. It is better than the competition. Often by a wide margin.

"This is not an incremental update," said Marcus Wei, a benchmark analyst. "This is a leap. The gains on Terminal-Bench alone – from 75% to 83% – represent hundreds of millions of dollars in research and engineering. OpenAI did not just tune GPT-5.4. They rebuilt something fundamental."

Part II: The Speed and Efficiency Miracle – Same Latency, Better Results

One of the most remarkable claims in OpenAI's announcement is that GPT-5.5 matches GPT-5.4's per-token latency while delivering significantly higher intelligence. In the world of large language models, this is unusual. More capable models are almost always slower, because they require more computation per token.

OpenAI achieved this through a combination of architectural improvements and a new serving infrastructure built around NVIDIA GB200 and GB300 NVL72 systems. But the most fascinating detail is that GPT-5.5 and Codex helped improve the infrastructure that serves them.

"Codex helped the team move faster from idea to benchmarkable implementation, sketching approaches, wiring experiments, and helping identify which optimizations were worth deeper investment," the announcement reads. "GPT-5.5 helped find and implement key improvements in the stack itself. Put simply, the model helped improve the infrastructure that serves it."

One specific improvement involved load balancing and partitioning heuristics. Before GPT-5.5, OpenAI split requests on an accelerator into a fixed number of chunks to balance work across computing cores. But a pre-determined number of static chunks is not optimal for all traffic shapes. Codex analyzed weeks of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The result: a 20% increase in token generation speeds.

This is a glimpse of the recursive self-improvement that AI researchers have long theorized. The model is not just a product. It is a tool that improves the infrastructure for the next generation of the model. The flywheel is starting to spin.

"OpenAI just demonstrated the earliest stages of AI automating AI development," said Dr. Alex Chen, a former Google Brain researcher. "It's still heavily guided. A human still had to ask Codex to analyze the traffic patterns and write the heuristics. But the loop is closing. That is the real story. Not the benchmark numbers. The loop."

The model is also significantly more token efficient. In Codex, GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users. This efficiency translates directly into cost savings for developers and a better experience for ChatGPT users, who get higher-quality answers without waiting longer.

Part III: The API Pricing – Half the Cost of the Competition

OpenAI has priced GPT-5.5 aggressively. The API will be available at $5 per million input tokens and $30 per million output tokens. For the Pro version, which offers even higher accuracy for the hardest problems, pricing is $30 input / $180 output.

The company is pitching this as "half the cost of competitive frontier coding models" – a direct shot at Anthropic's Claude Opus 4.7, which is priced significantly higher for comparable capabilities.

For developers building agentic applications – which can consume millions of tokens per day – the cost difference is material. A 50% reduction in inference cost can be the difference between a viable business and a non-starter. OpenAI is betting that price-sensitive developers will migrate from Anthropic back to OpenAI, especially now that GPT-5.5 matches or exceeds Claude's performance.

"We have carefully tuned the experience so GPT-5.5 delivers better results with fewer tokens than GPT-5.4 for most users, while continuing to offer generous usage across subscription levels," the company wrote.

The pricing also includes batch and flex options at half the standard rate, and priority processing at 2.5x the standard rate for latency-sensitive applications. The full range of options suggests OpenAI is serious about capturing the enterprise market at every tier.

"Anthropic has been gaining share because they had the best model AND competitive pricing," said Sarah Jenkins, a cloud procurement consultant. "Now OpenAI has the best model again, and they are undercutting Anthropic on price. That is a one-two punch. Anthropic's sales team is going to have a very difficult quarter."

Part IV: The Claude Mythos Comparison – An Open Secret

The elephant in the room is Claude Mythos – the restricted model that Anthropic deemed too powerful for public release, which was famously accessed by a Discord group within days of its launch. Mythos was the proof that Anthropic could build something truly special, even if they were afraid to share it.

GPT-5.5 now matches or exceeds several of Mythos's reported capabilities, but in a publicly available, commercially supported model. OpenAI is doing what Anthropic would not: putting frontier capability in the hands of developers and enterprises, with safeguards, but without crippling the intelligence.

"The comparison to Mythos is not official – Anthropic hasn't released public benchmarks for Mythos – but the internal numbers we've seen suggest GPT-5.5 is in the same league," said Wei. "That is a huge psychological shift. Anthropic had this secret weapon. OpenAI just matched it and is selling it to anyone with an API key. That changes the competitive dynamic overnight."

Anthropic has been struggling with rate limit complaints and quality degradation as demand for Claude has surged. Users have reported that Claude feels slower and less reliable than it did a few months ago – the inevitable result of scaling infrastructure faster than capacity. OpenAI, with its vast Azure-backed compute resources, has less exposure to this problem.

"Anthropic is a victim of its own success," said Vasquez. "They grew too fast. Their infrastructure couldn't keep up. OpenAI, for all its chaos, has been building for scale from day one. They have the headroom. And now they have the model. That is a dangerous combination for Anthropic."

Part V: The Scientific Breakthrough – Proving a Ramsey Number Theorem

Beyond the benchmarks and the pricing, GPT-5.5 has already delivered at least one genuine scientific result: it helped discover a new proof about Ramsey numbers , one of the central objects in combinatorics.

Ramsey numbers ask, roughly, how large a network has to be before some kind of order is guaranteed to appear. Results in this area are rare and often technically difficult. An internal version of GPT-5.5 with a custom harness found a proof of a longstanding asymptotic fact about off-diagonal Ramsey numbers. The proof was later verified in Lean, a formal proof assistant.

"This is a concrete example of GPT-5.5 contributing not just code or explanation, but a surprising and useful mathematical argument in a core research area," the announcement reads.

The result is early – it is not a Fields Medal-level breakthrough – but it is a proof of concept. GPT-5.5 can do mathematics that is novel, correct, and useful. That is a capability that, even a year ago, was considered years away.

"This is the kind of thing that keeps me up at night – in a good way," said Dr. Chen. "We are not at 'AI proves Riemann Hypothesis' yet. But we are at 'AI proves a nontrivial lemma that a human mathematician would be proud of.' That is a milestone. And it happened quietly, almost as an aside in a product announcement."

For researchers, the implication is clear. GPT-5.5 is not just a coding assistant or a research librarian. It is a co-scientist – able to explore ideas, test hypotheses, and contribute original insights. The GeneBench and BixBench results point in the same direction: the model can analyze complex biological data and propose meaningful interpretations.

"Losing access to GPT-5.5 feels like I've had a limb amputated," one NVIDIA engineer who had early access told OpenAI. Hyperbole, perhaps. But it captures the growing dependence of technical professionals on these models – and the depth of the shift when a new model raises the ceiling.

Part VI: The Safety and Preparedness Framework – Cyber and Bio Risks

OpenAI is not releasing GPT-5.5 without safeguards. The company has classified the model's biological/chemical and cybersecurity capabilities as High under its Preparedness Framework – the second-highest tier, below Critical.

The model underwent a full safety and governance process before release, including preparedness evaluations, domain-specific testing, new targeted evaluations for advanced biology and cybersecurity capabilities, and robust testing with external experts and nearly 200 early-access partners.

OpenAI is deploying stricter classifiers for potential cyber risk, which the company acknowledges "some users may find annoying initially, as we tune them over time." The company is also expanding its Trusted Access for Cyber program, which gives verified defenders (organizations responsible for critical infrastructure) access to more capable cyber tools with fewer restrictions.

"We are treating the biological/chemical and cybersecurity capabilities of GPT-5.5 as High under our Preparedness Framework," the system card states. "While GPT-5.5 didn't reach Critical cybersecurity capability level, our evaluations and testing showed that its cybersecurity capabilities are a step up compared to GPT-5.4."

On CyberGym , an internal cybersecurity benchmark, GPT-5.5 scores 81.8% vs. GPT-5.4's 79.0% and Claude 4.7's 73.1%. On Capture-the-Flag challenge tasks, it reaches 88.1% vs. GPT-5.4's 83.7%. The model is getting genuinely good at offensive security tasks – which means OpenAI's mitigations need to be genuinely good as well.

"The viable path is trusted access, robust safeguards that scale with capability, and the operational capacity to detect and respond to serious misuse," the company wrote. It is a measured, responsible tone – a stark contrast to the "move fast and break things" ethos of an earlier era. OpenAI has grown up. And GPT-5.5 is the product of that maturity.

Conclusion: The Vibe Shift Is Real

For months, the AI narrative has been dominated by Anthropic's ascendancy. Claude was the coder's choice. Claude was the benchmark leader. Claude was the model that felt like it was pulling away. OpenAI, by contrast, seemed distracted – fighting internal battles, launching side quests, losing executives.

GPT-5.5 changes the narrative. Not by a little. By a lot.

The model is faster, cheaper, and more capable than its predecessor. It beats Claude on benchmark after benchmark. It has matching or exceeding performance compared to Anthropic's restricted, secret-sauce Mythos model. It improves the infrastructure that serves it. It discovers new mathematics. And it does all of this while costing half as much as the competition.

"The vibe is shifting once again," said Vasquez. "OpenAI is rapidly shipping powerful new upgrades and rekindling the magic that felt a bit lost on previous releases. With Anthropic now wading through rate limit and quality degradation complaints, it is a big week for Sam Altman and co. on the sentiment front."

The challenge for Anthropic is now acute. They can respond with a model of their own – Opus 4.8, perhaps, or a public release of a Mythos-derived model. But their infrastructure constraints are real. Their rate limit complaints are real. Their quality degradation is real. And none of those problems can be solved by a better model alone.

OpenAI, by contrast, has the scale. They have the Azure relationship. They have the headroom. And now, they have the model.

GPT-5.5 is not the final word. It never is. But it is a statement. The king is not dead. The king is back. And the king is selling tokens at half price.

For developers, enterprises, and researchers, the math is simple: better performance, lower cost, and a company that has finally learned to focus. That is a winning combination. And for the first time in a long time, it belongs to OpenAI.

Spud, indeed. But potatoes, as it turns out, can be dangerous when thrown with force.

Your one-stop shop for automation insights and news on artificial intelligence is EngineAi.

Did you like this article? Check out more of our knowledgeable resources:

• 📰 In-depth analysis and up-to-date AI news 

• 🤝 Visit to learn about our goal and knowledgeable staff 

• 📬 Use this link to share your project or schedule a free consultation 

Watch this space for weekly updates on digital transformation, process automation, and machine learning. Let us assist you in bringing the future into your company right now

OpenAI retakes the frontier with GPT 5.5

📚 You might also like

Three OpenAI leaders exit as reshuffle continues

Anthropic rolls out Claude Design

Adobe’s new agentic AI platform for enterprises