GPT-5.5 'Spud' Launch: Benchmarks, Pricing, and What It Means for the Model Race

Quick Answer: GPT-5.5 (“Spud”), released April 23, 2026, is OpenAI’s first fully retrained base model since GPT-4.5. It scores 60 on the Artificial Analysis Intelligence Index (vs. 57 for Claude Opus 4.7), leads on Terminal-Bench and agentic evaluations, but trails Opus 4.7 on SWE-bench coding benchmarks. API pricing doubles to $5/$30 per million tokens, though the model uses ~40% fewer tokens per task. API access was not available at launch.

Last updated: April 2026

What GPT-5.5 actually is

GPT-5.5 is not another incremental update. OpenAI’s GPT-5.0 through 5.4 were refinements and fine-tunes of the same base model. GPT-5.5 is the first complete retrain since GPT-4.5 — a new foundation, not a patch.

It shipped just six weeks after GPT-5.4, available immediately in ChatGPT (Plus, Pro, Business, Enterprise) and Codex. API access was deliberately held back. OpenAI stated that “API deployments require different safeguards,” a move critics read as a distribution strategy that prioritizes OpenAI’s own products.

The codename is “Spud.” Greg Brockman called it “the smartest and most intuitive model in the company’s history” and “a new class of intelligence.”

Three variants, different depths

GPT-5.5 ships in three forms:

Variant	Access	Use case
GPT-5.5	Plus ($20/mo), Business, Enterprise	Fast responses, everyday tasks
GPT-5.5 Thinking	Plus, Pro, Business, Enterprise	Extended reasoning, deeper chain-of-thought
GPT-5.5 Pro	Pro ($200/mo) only	Deepest reasoning for high-stakes tasks

API model identifiers: gpt-5.5 and gpt-5.5-pro. Neither was available via API on launch day.

Free-tier users do not get access to any GPT-5.5 variant.

Benchmark reality check

GPT-5.5 leads on aggregate intelligence indices but the picture is mixed when you look at specific evaluations:

Benchmark	GPT-5.5	Claude Opus 4.7	Winner
Artificial Analysis Index	60	57	GPT-5.5
Terminal-Bench 2.0	82.7%	69.4%	GPT-5.5
Expert-SWE (median human: 20 hrs)	73.1%	—	GPT-5.5
OSWorld-Verified	78.7%	78.0%	GPT-5.5 (narrow)
SWE-bench Pro	58.6%	64.3%	Opus 4.7
SWE-bench Verified	—	87.6%	Opus 4.7
HLE (no tools)	41.4%	46.9%	Opus 4.7
HLE (with tools)	52.2%	54.7%	Opus 4.7
MCP-Atlas (tool use)	75.3%	79.1%	Opus 4.7

The summary from llm-stats.com: “Opus 4.7 leads on 6 of 10 shared benchmarks, GPT-5.5 on 4, with margins between 2 and 13 points.”

One important caveat: OpenAI claimed 82.7% on Terminal-Bench 2.0, but the benchmark owner’s own leaderboard showed 82.0% ± 2.2 on the same day. Small discrepancy, but worth noting given the competitive context.

What GPT-5.5 does differently

The core thesis: legibility

Where previous models required carefully structured prompts and multi-step supervision, OpenAI says 5.5 can take a “messy, multi-part task” and independently plan, execute, and iterate. Greg Brockman: “What is really special about this model is how much more it can do with less guidance.”

GPT-5.5 integrates a novel “recurrent self-refinement loop” — the model internally critiques and revises outputs across multiple reasoning passes before generating a final response. This is architecturally different from chain-of-thought prompting; it happens inside the model’s inference process.

40% fewer tokens, same results

This is the most practically significant change. GPT-5.5 uses roughly 40% fewer output tokens than GPT-5.4 to complete equivalent tasks. For agentic workflows where you’re paying per token, this partially absorbs the doubled per-token price.

Artificial Analysis estimates the net effective cost increase is about 20% compared to GPT-5.4, not the 100% the rate card suggests.

Agentic capabilities

GPT-5.5 excels at:

Writing and debugging code autonomously
Researching online across multiple sources
Analyzing data and creating documents
Operating software through computer use
Moving across tools until a task is finished

The 82.7% Terminal-Bench score tests exactly this: complex command-line workflows requiring planning, iteration, and tool coordination.

Pricing: doubled per-token, offset by efficiency

	GPT-5.5	GPT-5.4	Claude Opus 4.7
Input (per 1M tokens)	$5	$2.50	$5
Output (per 1M tokens)	$30	$15	$25
Context window	1M	1M	1M

The output price is the headline: $30 per million tokens, 2x GPT-5.4 and 20% more than Opus 4.7.

But because GPT-5.5 uses ~40% fewer output tokens per task, the effective cost increase over GPT-5.4 is closer to 20%. Against Opus 4.7, the comparison depends on workload — Opus 4.7’s new tokenizer can inflate token counts by up to 35% on code-heavy prompts, which narrows the gap.

For ChatGPT subscribers, there’s no additional cost. Plus ($20/mo), Pro ($200/mo), Business, and Enterprise tiers all include GPT-5.5 at no extra charge.

Enterprise positioning

OpenAI is clearly targeting Anthropic’s enterprise lead:

Served on NVIDIA GB200 NVL72 infrastructure with 35x lower cost per million tokens and 50x higher token throughput per watt compared to prior-generation systems
Available to Business and Enterprise ChatGPT tiers on day one
System card published alongside launch for enterprise safety review
40% reduction in inference costs at the infrastructure level

The API holdback is part of this strategy: by funneling enterprise users through ChatGPT and Codex first, OpenAI controls the experience and collects usage data before opening the firehose.

Where GPT-5.5 wins and loses

Wins:

Aggregate intelligence benchmarks (highest overall score)
Terminal/agentic workflows (13-point lead over Opus 4.7)
Computer use and autonomous task completion
Token efficiency (40% fewer tokens per task)
Enterprise infrastructure (NVIDIA GB200 partnership)

Loses:

Traditional coding benchmarks (SWE-bench Pro: 58.6% vs Opus 4.7’s 64.3%)
Knowledge-heavy reasoning (HLE: 5+ point deficit)
Tool use precision (MCP-Atlas: 4-point deficit)
API availability at launch
Per-token pricing (most expensive frontier model)

The competitive picture

Three frontier models shipped in the same week of April 2026:

Claude Opus 4.7 (April 16): Best at coding, tool use, and knowledge reasoning
GPT-5.5 (April 23): Best at agentic autonomy, terminal workflows, and aggregate intelligence
DeepSeek V4 (April 24): Competitive performance at 20-50x lower cost

The market is splitting along workflow lines rather than converging on a single winner. If your primary workload is coding, Opus 4.7 leads. If it’s autonomous multi-step task completion, GPT-5.5 leads. If cost matters more than marginal performance, DeepSeek V4 changes the equation entirely.

Who should switch

Switch now if:

You’re a ChatGPT Plus/Pro subscriber (it’s already there, no extra cost)
Your workload is agentic: multi-step tasks, computer use, autonomous research
You were hitting GPT-5.4’s limits on complex terminal workflows

Wait if:

You need API access (not available at launch)
Your primary workload is coding (Opus 4.7 is stronger on SWE-bench)
You’re cost-sensitive on API pricing (measure effective cost first)

Sources: OpenAI Community, Axios, Fortune, The Next Web, Artificial Analysis, Wikipedia, NVIDIA Blog