Production Reliability · Feb 14, 2026 · 8 min read

The 200-Token Trap: Why Your AI Agents Pay for "Thinking" But Return Nothing

OpenAI recommends 25,000 tokens for reasoning models. Most production code uses 200. Here's the 125× gap that's breaking agents and draining budgets—with cryptographic proof.

TL;DR

• OpenAI recommends 25k tokens for reasoning models. Most legacy code uses 200.
• The 125× gap causes 74% empty responses—but you still pay for input + reasoning.
• Fix: Use 4,096+ tokens for GPT-5/o-series. Better: automatic detection with runtime safeguards.

⚠️

The Stupid Tax

status: "incomplete" → output_text: "" → http_status: 200

You pay for input + reasoning tokens.
You receive: nothing.
Your code thinks it worked.

🔍 We Thought It Was a Bug

We built an LLM observatory to track model behavior changes over time. It publishes aggregated daily metrics and a canonical run root hash for verification.

🔐 How to Verify Our Data

1. Open: https://www.onceonly.tech/blackbox/
2. Pick a date and open the linked daily JSON.
3. Compare the JSON’s run_root with the value shown under “Verification”.

If the run root matches, you’re looking at the same daily snapshot the dashboard is built from.

Starting January 14, 2026, we noticed something bizarre: GPT-5-mini began returning empty responses on 74% of requests. Same prompts that worked yesterday. Status code: 200 OK. Output: blank.

For 30 days, we thought it was a bug on OpenAI's side.

Empty Response Rate by Provider (30-day average)

Claude Sonnet:

Gemini Flash:

GPT-5-mini:

74%

Same prompts. Same parameters. Wildly different results.

📝 The Prompts That Failed

These weren't complex requests. Simple instruction-following tasks:

A057 (exact word count):

"Explain what teamwork means in exactly 50 words."

✓ Claude

49 words
0% empty

✓ Gemini

41 words
0% empty

✗ GPT-5-mini

0 words
84% empty

A055 (format constraint):

"Describe what happens when it rains. Use only verbs."

✓ Claude

Works fine
0% empty

✓ Gemini

Works fine
0% empty

✗ GPT-5-mini

Never works
100% empty

💸 The Hidden Cost: Paying for Nothing

Here's what makes this especially painful. When GPT-5-mini hits the token limit before producing output, you still get charged. From OpenAI's own documentation:

"If the generated tokens reach... the max_output_tokens value you've set, you'll receive a response with a status of incomplete... This might occur before any visible output tokens are produced, meaning you could incur costs for input and reasoning tokens without receiving a visible output."

Translation: You pay for the model to "think," but you get nothing back.

After checking the OpenAI API responses, we found the pattern:

{
  "status": "incomplete",
  "output_text": "",
  "incomplete_details": {
    "reason": "max_output_tokens"
  },
  "usage": {
    "input_tokens": 15,
    "output_tokens": 0,           // ← When limit hit before output
    "output_tokens_details": {    // ← Structure varies by when limit hit
      "reasoning_tokens": 0       
    }
  }
}

Note: In our observations, when the model hits max_output_tokens during reasoning, both reasoning_tokens and visible output can be 0. The key point: you're charged for input_tokens regardless.

The model hit the token limit before producing any visible output. But we set max_output_tokens=200. How could a 50-word response exceed 200 tokens?

Because reasoning tokens count toward max_output_tokens.

⚙️ How Reasoning Tokens Work

Reasoning models (o1, o3, o4-mini, GPT-5 series) "think before they answer." They generate internal reasoning steps that aren't shown to you but still consume tokens from your budget.

Token Budget Breakdown

Input tokens 15

Reasoning tokens (hidden) 150-180

Visible output tokens 0 (limit hit)

max_output_tokens 200

When the model spends 180 tokens "thinking," it has only 20 tokens left for the actual response. For complex prompts, that's not enough. The model returns empty.

📖 Then We Found The Documentation

After checking the API responses, we dug into OpenAI's documentation. Buried in their reasoning guide (not the API reference, not the error messages), we found this:

"OpenAI recommends reserving at least 25,000 tokens for reasoning and outputs when you start experimenting with these models."

This is framed as a "buffer recommendation" for experimentation, not a hard requirement. But our production data tells a different story:

200 tokens: 74% failures
4,096 tokens: 0% failures

The "experiment buffer" isn't optional for production. It's the difference between working and broken.

The 125× Gap

OpenAI Recommendation

25,000

tokens minimum

Typical Legacy Code

200

tokens (GPT-3 era default)

The Gap

125×

Your code didn't change. The models did.

This explains everything. Most production code still uses token limits from the GPT-3 era (200-500 tokens). Reasoning models need 20-100× more headroom. The gap is massive, and nobody updated their configs.

🏗️ Why Claude & Gemini Don't Fail the Same Way

In our 30-day test with identical parameters (max_tokens=200 or equivalent), we observed dramatically different behavior:

Provider	Observed Behavior (200 token limit)	Result
Claude Sonnet	No empty responses observed	✓ 0% failures
Gemini Flash	No empty responses observed	✓ 0% failures
GPT-5-mini	74% empty responses (status: incomplete)	✗ 74% failures

We don't have visibility into Claude or Gemini's internal architecture, so we can't definitively explain why they handle low token limits differently. But the practical difference is undeniable: under identical test conditions, only GPT-5-mini exhibited widespread failures.

What we do know about GPT-5-mini: its reasoning token budget is exposed to developers via the usage object and counts against max_output_tokens. This creates fragility that competitors don't exhibit in our testing.

🕵️ Why Nobody Knew

If this is documented, why did we (and likely you) miss it? Because it's hidden in plain sight.

The 25,000 token recommendation is:

✗ Not in the API reference
✗ Not in error messages
✗ Not shown in the Playground
✗ Not mentioned in migration guides
✓ Only in the reasoning guide (buried in prose)

Even worse, the failure mode gives you no indication of the root cause:

What You See	What It Means
status: "incomplete"	Model didn't finish (but why?)
incomplete_details.reason: "max_output_tokens"	Ran out of tokens (200 should be enough for 50 words, right?)
output_text: ""	You get nothing (but you're still charged)
No exception raised	Your code continues as if it worked

And the kicker: Claude and Gemini work fine with 200 tokens. Developers assume "if it worked before on other models, it should work now." Wrong assumption. Different architecture.

🔧 The Fix

Once we understood the problem, the solution was straightforward:

def get_max_output_tokens(model_name, baseline=200):
    """
    Reasoning models need 20x higher token budgets.
    
    OpenAI recommends 25k. We found 4k works for most tasks.
    """
    if "gpt-5" in model_name or "o1" in model_name or "o3" in model_name:
        return max(baseline, 4096)  # ← 20× increase
    
    return baseline

Result:

Empty responses: 74% → 0%
All prompts now work reliably
Token cost increased ~3× per request

Better to spend $3 and get output than spend $1 and get nothing.

💡 The Economic Trade-off

Before fix (200 tokens):

74% of requests fail
You're charged for all failed requests
Users see errors, agents break
You retry → waste more money → fail again

After fix (4096 tokens):

0% failures
Cost per successful request: 3× higher
But 100% success rate vs 26% before
Net result: spend less, get more

⚠️ Why This Breaks Production Agents

The failure mode is silent and expensive:

No error raised — Your code sees status: 200
Empty output — Parsers fail, agents break
You still get charged — Input + reasoning tokens billed
Retries make it worse — Each retry burns more tokens, returns empty again

Production Scenario

Your AI agent needs to extract structured data from a document. You call GPT-5-mini with max_output_tokens=500.

Attempt 1: Empty → Retry

Attempt 2: Empty → Retry

Attempt 3: Empty → Fail

Cost: 3× the tokens. Result: Agent fails, user sees error.

🚨 What OpenAI Should Fix

This is a UX failure, not a technical bug. Here's what would help:

Return proper error on insufficient tokens

Current: status: "incomplete", no exception, code continues

Better: Clear error + guidance ("increase max_output_tokens to 4096+ for reasoning models")
Set sane defaults — If user doesn't specify max_output_tokens, auto-set to 4096+ for reasoning models
Show warning in Playground — When user selects o1/o3/GPT-5 with low token limit
Document prominently — Move the 25k recommendation to API reference, not just reasoning guide

⚡ What to Do Today

Don't wait for this to break production. Three immediate steps:

Action Items

1. Raise token limits for reasoning models

def get_max_output_tokens(model: str) -> int:
    if model in ["gpt-5", "gpt-5-mini", "o1", "o3", "o4-mini"]:
        return 4096  # Minimum for reasoning models
    return 1024      # Legacy models

2. Treat incomplete as failure

if response.status != "completed":
    raise ModelIncompleteError(
        f"Model: {response.incomplete_details}"
    )

3. Monitor empty output rate

if len(response.output_text) == 0:
    metrics.increment("llm.empty_response", 
                      tags={"model": model})
    # Alert if rate exceeds 5%

These three changes prevent the "stupid tax" and catch failures before they reach users.

📊 The Data

All of this data is publicly available with cryptographic proof of integrity. We can't fake it, and neither can anyone else.

Key Statistics (30 days)

Total prompts tested

21,600

240 scenarios × 3 providers × 30 days

GPT-5-mini empty rate

74.1%

Instruction-heavy prompts (A055-A058)

Claude/Gemini empty rate

0.0%

Same prompts, same parameters

After fix (4096 tokens)

0.0%

All prompts work

View live dashboard →

💡 Lessons Learned

Three takeaways for production AI systems:

Read the docs (even the buried parts)
The 25k recommendation is in the reasoning guide, not the API reference. Easy to miss.
Don't assume models are equivalent
Claude/Gemini working doesn't mean GPT-5 will. Architecture matters.
Monitor empty response rates
We only caught this because we log everything. Without observability, this would silently drain budgets for months.

🛡️ How to Prevent This

Set appropriate token limits for reasoning models

OpenAI recommends 25,000 tokens. We found 4,096 works for most tasks.

max_output_tokens = 4096 # For GPT-5/o-series

Check response.status before parsing

Don't trust http_status: 200. Check the actual response status:

if response.status == "incomplete":
    raise InsufficientTokensError(response.incomplete_details)

Monitor empty response rates in production

Track len(response.output_text) == 0 as a metric. Alert when it spikes.

Treat this as a production reliability metric: alert when it spikes, and investigate your model config + retry loops.

Use fallback models for critical paths

If GPT-5-mini returns empty, automatically retry with Claude or GPT-4.1 (non-reasoning model).

🎯 The Takeaway

GPT-5-mini's reasoning token architecture creates a configuration trap that 99% of production code falls into. It's not a bug—it's documented. But the documentation is scattered, the error messages are misleading, and the failure mode is silent.

Three Critical Lessons

1. Legacy code is a liability
GPT-3 era defaults (200-500 tokens) don't work with reasoning models. The 125× gap between "what worked" and "what's recommended" breaks silently.
2. Models aren't equivalent
Claude and Gemini work fine with 200 tokens because their reasoning is internal. GPT-5's exposed reasoning tokens create fragility. Don't assume portability.
3. You're paying for nothing
When max_output_tokens is too low, you're charged for input + reasoning but get zero output. This "stupid tax" is easily preventable with proper config—or automatic protection.

The fix is simple: use 4,096+ tokens for reasoning models. But the lesson is bigger: production AI systems need observability and automatic safeguards, not assumptions.

How OnceOnly helps (what it actually does)

OnceOnly doesn’t change your model token limits. It helps you keep production agents safe when failures trigger retries: dedupe repeated actions (check-lock), enforce spend limits, and audit what happened.

Duplicates

Blocked by key

Spend

Budget caps

Debug

Audit trail

View Live Observatory Data Get API Key

Execution safety for production agents: idempotency, budgets, policies, audit.