Models // [Cards]

GPT 5.5

GPT-5.5 now powers deep mode. It is more capable, but can cost more.

Model
openai/gpt-5.5
Status
promoted
Agent mode
deep
Replaces
openai/gpt-5.4
Default reasoning
medium

Pros

GPT-5.5 is more agent-shaped than GPT-5.4. It is better at taking a concrete target, using tools, staying inside constraints, and carrying the task through to a usable result. It is more interactive. It is easier to steer. It needs less process scaffolding. The model is at its best when we tell it what outcome we want and how to verify the result, then let it choose the path.

Capability. GPT-5.5 has the best ceiling we tested. On our internal 102-task SWE eval, GPT 5.5 xhigh is the quality leader: 55 passes, 0.598 normalized pass-rate, and 0.588 mean reward. GPT 5.5 medium is also roughly comparable to GPT 5.4 high: 54 passes versus 53. On Terminal-Bench, the jump is much clearer: GPT-5.5 medium moves from ~65.2% on GPT-5.4 to ~79.8%, and GPT-5.5 xhigh moves from ~74.7% to ~82.0%. OpenAI’s public evals show the same direction: GPT-5.5 improves over GPT-5.4 on Terminal-Bench 2.0, Expert-SWE, GDPval, OSWorld-Verified, MCP Atlas, and Tau2-bench Telecom.

Steerability. GPT-5.5 feels better in the agent loop. It is more interactive, easier to correct, and better at continuing from a concrete target. It is strongest when the task has a clear outcome and a way to verify success.

Less process scaffolding. GPT-5.5 does not need the old GPT-5.4 style of prompt where every step is spelled out. OpenAI’s guidance says to start with the smallest prompt that preserves the product contract, reduce step-by-step process guidance, and let GPT-5.5 choose the path unless the product requires the path.

More useful reasoning levels. low is surprisingly good and worth using for small, cheap-to-verify tasks. medium is the right default for normal deep work. xhigh is the quality lever for hard tasks. high is not automatically better; in our internal eval, GPT 5.5 high was more expensive than medium and performed worse.

Cons

Price. GPT-5.5 tokens are 2x GPT-5.4 tokens on standard pricing: $5/$30 per 1M input/output tokens versus GPT-5.4 at $2.50/$15. GPT-5.5 often uses fewer tokens, but not enough to make it automatically cheaper. In our internal eval, GPT 5.5 medium costs $312 versus GPT 5.4 high at $225, about 39% more for roughly comparable quality.

ZDR and caching caveats. GPT-5.5 requires extended prompt caching. prompt_cache_retention=in_memory errors for GPT-5.5 and GPT-5.5 Pro. Extended prompt caching stores encrypted key/value tensors on GPU-local storage as application state for up to 24 hours. OpenAI also reserves the right to make GPT-5.5, GPT-5.5 Pro, and future models ineligible for Zero Data Retention or Modified Abuse Monitoring for specific customers when necessary to investigate severe risk activity.

Literalness. It still does not rescue vague tickets. It does not reliably infer the product judgment that was never written down. If the task is underspecified, it can still make a clean change to the wrong thing. When the job is specified well enough that success can be checked, GPT-5.5 is a real step up from GPT-5.4.

How To Use This Model?

Users can give deep more serious coding work with less ceremony.

With GPT-5.4, strong prompts often had to compensate for the model. They told it to inspect first, make a plan, avoid stopping early, run tests, summarize evidence, preserve constraints, and so on. Some of those rules still matter, but most of them should now live in the agent harness, tool descriptions, and repo guidance files. They should not be repeated by the user in every request.

For GPT-5.5, the task prompt should read more like a good engineering ticket: state the outcome, define what good means, name the constraints, explain how to verify the result, and say what the final answer should contain. That's very close to what gets you good results with Opus 4.7.

That is the important prompting shift with GPT-5.5: outcome-focused prompting works better than process-focused prompting unless the exact process is part of the product contract.

Use GPT-5.5 by removing menial process scaffolding from the user prompt. Do not write “first inspect, then plan, then edit, then test” unless that exact process matters. Put critical repo rules in AGENTS.md or the relevant guidance files. Put tool behavior in tool descriptions. Keep the user prompt focused on the outcome.

A good prompt shape, is something like this:

Outcome.
What good means.
Constraints.
How to verify. // if you don't have it already in your AGENTS.md files

For normal deep work, start at medium. Use low for narrow tasks with cheap verification. Use xhigh when maximum quality matters more than cost. Do not treat high as the safe default; our evals do not support that.

Verdict

The efficiency story is overstated.

On our hard software-engineering benchmark, GPT-5.5 at medium is not clearly more cost-efficient than the old deep path. frontier-med gets 54 passes at $312. deep-default gets 53 passes at $225. That is roughly comparable quality at about 39% higher cost.

“GPT-5.5 is cheaper for the same quality.” doesn't hold up.

GPT-5.5 medium gives us a better-feeling, a more steerable agent with roughly comparable benchmark performance to the old deep path, and GPT-5.5 high/xhigh gives us access to more raw capability when the task justifies the cost.

Evals

These are our internal evals, which are hard engineering tasks.

Run Pass / 102 Norm pass-rate Mean reward Cost
GPT 5.5 - xhigh 55 / 102 — 53.9% 0.598 0.588 $513
GPT 5.5 - medium 54 / 102 — 52.9% 0.551 0.542 $312
GPT 5.5 - high 50 / 102 — 49.0% 0.543 0.534 $425
GPT 5.4 - high 53 / 102 — 52.0% 0.546 0.548 $225

The important comparison is GPT 5.5 medium versus GPT 5.4 high. GPT 5.5medium gets one more pass, but costs $312 instead of $225. That is roughly comparable quality at about 39% higher cost. So this is not a clean efficiency win.

GPT-5.5 does use fewer tokens in many cases, and medium can now do work that previously needed high or xhigh on GPT-5.4. But GPT-5.5 tokens are about twice as expensive as GPT-5.4 tokens on standard pricing, so the token-efficiency gain does not automatically turn into a cost-efficiency gain.

GPT 5.5 xhigh is the quality leader. It has the best pass count, best normalized pass-rate, and best mean reward. But it costs $513, which is 90% more than GPT 5.5 medium and 2.28x more than GPT 5.4 high. The capability is there, but it is expensive.

GPT-5.5 high is the awkward result. It costs $425 and performs worse than GPT 5.5 medium on this run. That does not mean high is never useful. It means we should not treat higher reasoning as automatically better. If the task does not benefit from the extra reasoning budget, the model can spend more money and still produce a worse result.

Our wrapped Terminal-Bench runs look much better for GPT-5.5.

Run GPT-5.4 GPT-5.5 Change
medium on TBench-2.0 ~65.2% ~79.8% +14.6 pts
xhigh on TBench-2.0 ~74.7% ~82.0% +7.3 pts

GPT-5.5 medium also has about 16% better p50 task-completion latency than GPT-5.4 medium, and it consumes about 23% fewer p50 output tokens. Qualitatively, this matches how the model feels: more interactive, easier to steer, less sluggish.

Because GPT-5.5 tokens are twice as expensive, total credits are still about 40% higher in this run despite the lower output-token count.

GPT-5.5 is faster and more capable per run, and it can do GPT-5.4 high/xhigh-class work at medium. But because the tokens are 2x the price, medium can still be slightly more expensive than the GPT 5.4 at high and xhigh.

OpenAI’s public evals tell the same broad story. GPT-5.5 improves over GPT-5.4 on Terminal-Bench 2.0, from 75.1% to 82.7%; Expert-SWE, from 68.5% to 73.1%; GDPval, from 83.0% to 84.9%; OSWorld-Verified, from 75.0% to 78.7%; MCP Atlas, from 70.6% to 75.3%; and Tau2-bench Telecom, from 92.8% to 98.0%. SWE-Bench Pro is the caveat: GPT-5.5 is only slightly ahead of GPT-5.4, 58.6% versus 57.7%, and trails Claude Opus 4.7 in OpenAI’s published table.

Reasoning

The new meta is to try GPT-5.5 at low, or even with no reasoning. That is worth testing. low is surprisingly good.

But deep should default to medium.

medium is the best default because it gets most of the model’s improved capability without paying for the full top-end reasoning path. It is also what OpenAI defaults to for GPT-5.5. Their deployment guidance says lower effort is faster and uses fewer reasoning tokens, while higher effort gives the model more room for planning, debugging, synthesis, and multi-step tradeoffs. The right value depends on the task, not just the model.

Reasoning When to use it
low Small changes, simple repo questions, narrow patches, cheap verification, speed-sensitive work.
medium Default for deep: normal bug fixes, feature work, migrations, test failures, and multi-file edits.
xhigh Use when maximum quality matters more than cost: hard debugging, broad changes, high-risk work, or tasks where failure is expensive.

So the default should not be “turn reasoning up.” The default should be medium, with users escalating when the task is hard enough to justify it.

Tools

We are adding the view_image tool.

If the model needs to inspect UI state or visual state, it should have a direct observation tool. Without that, it starts compensating in weird ways. Hitesh saw that when the tool was missing, it started to write scripts to run OCR.

OpenAI is training the model to be more capable at working with documents and computer-use, so we need to give the model the tools to do similar work, so it does not compensate in other ways.

System prompt

The GPT-5.5 prompt is roughly half the size of GPT-5.4's system prompt in deep mode.

The main thing stays intact: read before editing, prefer the smallest correct change, carry work through verification, explain what changed and why.

What changed is how much we need to spell out.

GPT 5.4 needed a lot of instructions of how to use the commentary and final channels: when to send an update, when not to, length rules, what makes a "meaningful" progress message, how to phrase the first message of a turn, how to structure a final answer, when to use prose vs. bullets, when to add headings, how long each section can be. GPT-5.5 doesn't. A two-line description - commentary updates are 1–2 sentences when something changes the user's understanding; final answers are concise reports - is enough.

The main new constraint in the instruction is to prevent over-reading and task expansion. GPT-5.5 will read every relevant guidance file and try to satisfy every rule it finds, especially at higher reasoning. GPT 5.3 Codex and GPT 5.4 had to be reminded to follow them. The new prompt adds an explicit constraint - treat guidance files and skills as constraints and shortcuts, not as invitations to expand the task; apply the relevant parts.

Another thing that has been trained into the model is verification. That prompts the model to overdo it, if you instruct it again in the system prompt. GPT-5.4 was told to verify before declaring done and to follow guidance-file checks. GPT-5.5 is told to scale verification to risk and blast radius: a typo needs none, a localized edit needs a focused check, only shared/cross-module changes warrant the broad suite. Read-only and explanation tasks skip verification entirely.

We also added one interaction rule: if the user refines the task mid-turn, the newest message wins; a status request means update and continue; after compaction, resume from the summary instead of restarting.

Where It's Weak

The main weakness is literalness.

GPT-5.5 does not reliably read between the lines. It will not infer the missing requirement or interpret and ambiguous instruction to "fix this". It may execute the written task well while missing the actual thing the user wanted.

You need to give the model something checkable.

It is also weaker on vague product judgment than on execution. If the hard part is deciding what should exist, spend more effort on the brief. Once the target is clear, GPT-5.5 is strong at carrying it out.

How We Use It

  • We give the agent tools and guidance to spin up the CLI in tmux, drive it with real keypresses, inspect the focus debug panel, and capture the pane. That lets Amp debug itself, you only need to describe the problem well enough:

    If i start the CLI with amp threads continue it opens the thread picker instantly. On esc, it doesn't auto focus on the prompt editor, which should be the element in the root scope with autofocus. It also doesn't handle ctrl-c anymore. I also can't hit ctrl-t. It could be that the thread picker is opened so early so that it is under the root scope and not the AppShellFocus.

  • Use it for bugs where repro matters. Tell it to prove the bug first, then patch:

    We have a flaky webhook bug in @apps/api/src/billing/webhooks.ts. Sometimes duplicate delivery events create two invoice records with the same provider event ID. I think the idempotency check happens too late, after the invoice write has already started. Please reproduce this with a focused test before changing the implementation. Then fix it without rewriting the billing flow. The existing successful webhook behavior should stay the same, and failed events should still be retryable.

  • Use GPT-5.5 for investigation, but don’t prompt it with “look into this.” Tell it what evidence would make the answer useful, and whether it should patch or stop.

    There is a startup latency regression in the desktop app after the recent workspace-indexing change. Please inspect the startup path and identify where the extra time is coming from. Use existing profiling or timing hooks if they exist. If you need temporary measurements, add them locally but do not leave noisy logging behind. I want a short diagnosis with evidence: where the time is spent, why the regression appeared, and the smallest fix you would make. If the fix is safe and localized, implement it and verify it. If the fix crosses subsystem boundaries or changes indexing semantics, stop after the diagnosis and recommended patch plan.

  • Ask for options, tradeoffs, and pseudo-code before implementation

    Read @packages/protocol/docs/overview.md, then inspect the client handshake code in @packages/client and @services/worker-coordinator. Sometimes a worker dies and restarts, but the connected client does not notice. The client still thinks it is attached as controller+observer. The restarted worker has lost heap state, thinks there is no controller, and waits for a fresh handshake. The client does not know it has to redo the handshake, so both sides wait forever. Please analyze the failure mode and propose 2–3 fixes.

Other Notable Changes

  • In-memory prompt caching is not supported for GPT-5.5. It uses extended prompt caching; the default is 24h, and in_memory will error for GPT-5.5
  • Extended prompt caching may store encrypted key/value tensors in GPU-local storage as application state, with a maximum retention period of 24 hours. OpenAI says the original prompt text is retained only in memory, while the KV tensors may be persisted locally.
  • There is a Zero Data Retention carveout: OpenAI reserves the right to make GPT-5.5, GPT-5.5 Pro, and future models ineligible for ZDR or Modified Abuse Monitoring for specific customers when reasonably necessary to investigate severe risk activity, with advance written notice to impacted customers.
  • GPT-5.5 supports a 1M token context window in the API; prompts above 272K input tokens are priced at 2x input and 1.5x output
  • text.verbosity is something interesting to try. Use it to control final answer shape instead of stuffing length instructions into every prompt.