The 200-Token Trap: Why Your AI Agents Pay for "Thinking" But Return Nothing
OpenAI recommends 25,000 tokens for reasoning models. Most production code uses 200. Here's the 125× gap that's breaking agents and draining budgets—with cryptographic proof.
TL;DR
- • OpenAI recommends 25k tokens for reasoning models. Most legacy code uses 200.
- • The 125× gap causes 74% empty responses—but you still pay for input + reasoning.
- • Fix: Use 4,096+ tokens for GPT-5/o-series. Better: automatic detection with runtime safeguards.
status: "incomplete" → output_text: "" → http_status: 200
You pay for input + reasoning tokens.
You receive: nothing.
Your code thinks it worked.
🔍 We Thought It Was a Bug
We built an LLM observatory to track model behavior changes over time. It publishes aggregated daily metrics and a canonical run root hash for verification.
- 1. Open:
https://www.onceonly.tech/blackbox/ - 2. Pick a date and open the linked daily JSON.
- 3. Compare the JSON’s
run_rootwith the value shown under “Verification”.
If the run root matches, you’re looking at the same daily snapshot the dashboard is built from.
Starting January 14, 2026, we noticed something bizarre: GPT-5-mini began returning empty responses on 74% of requests. Same prompts that worked yesterday. Status code: 200 OK. Output: blank.
For 30 days, we thought it was a bug on OpenAI's side.
Empty Response Rate by Provider (30-day average)
Same prompts. Same parameters. Wildly different results.
📝 The Prompts That Failed
These weren't complex requests. Simple instruction-following tasks:
"Explain what teamwork means in exactly 50 words."
0% empty
0% empty
84% empty
"Describe what happens when it rains. Use only verbs."
0% empty
0% empty
100% empty
💸 The Hidden Cost: Paying for Nothing
Here's what makes this especially painful. When GPT-5-mini hits the token limit before producing output, you still get charged. From OpenAI's own documentation:
"If the generated tokens reach... the max_output_tokens value you've set, you'll receive a response with a status of incomplete... This might occur before any visible output tokens are produced, meaning you could incur costs for input and reasoning tokens without receiving a visible output."
Translation: You pay for the model to "think," but you get nothing back.
After checking the OpenAI API responses, we found the pattern:
{
"status": "incomplete",
"output_text": "",
"incomplete_details": {
"reason": "max_output_tokens"
},
"usage": {
"input_tokens": 15,
"output_tokens": 0, // ← When limit hit before output
"output_tokens_details": { // ← Structure varies by when limit hit
"reasoning_tokens": 0
}
}
}
Note: In our observations, when the model hits max_output_tokens during reasoning, both reasoning_tokens and visible output can be 0. The key point: you're charged for input_tokens regardless.
The model hit the token limit before producing any visible output. But we set max_output_tokens=200. How could a 50-word response exceed 200 tokens?
Because reasoning tokens count toward max_output_tokens.
⚙️ How Reasoning Tokens Work
Reasoning models (o1, o3, o4-mini, GPT-5 series) "think before they answer." They generate internal reasoning steps that aren't shown to you but still consume tokens from your budget.
Token Budget Breakdown
When the model spends 180 tokens "thinking," it has only 20 tokens left for the actual response. For complex prompts, that's not enough. The model returns empty.
📖 Then We Found The Documentation
After checking the API responses, we dug into OpenAI's documentation. Buried in their reasoning guide (not the API reference, not the error messages), we found this:
"OpenAI recommends reserving at least 25,000 tokens for reasoning and outputs when you start experimenting with these models."
This is framed as a "buffer recommendation" for experimentation, not a hard requirement. But our production data tells a different story:
- 200 tokens: 74% failures
- 4,096 tokens: 0% failures
The "experiment buffer" isn't optional for production. It's the difference between working and broken.
The 125× Gap
This explains everything. Most production code still uses token limits from the GPT-3 era (200-500 tokens). Reasoning models need 20-100× more headroom. The gap is massive, and nobody updated their configs.
🏗️ Why Claude & Gemini Don't Fail the Same Way
In our 30-day test with identical parameters (max_tokens=200 or equivalent), we observed dramatically different behavior:
| Provider | Observed Behavior (200 token limit) | Result |
|---|---|---|
| Claude Sonnet | No empty responses observed | ✓ 0% failures |
| Gemini Flash | No empty responses observed | ✓ 0% failures |
| GPT-5-mini | 74% empty responses (status: incomplete) | ✗ 74% failures |
We don't have visibility into Claude or Gemini's internal architecture, so we can't definitively explain why they handle low token limits differently. But the practical difference is undeniable: under identical test conditions, only GPT-5-mini exhibited widespread failures.
What we do know about GPT-5-mini: its reasoning token budget is exposed to developers via the usage object and counts against max_output_tokens. This creates fragility that competitors don't exhibit in our testing.
🕵️ Why Nobody Knew
If this is documented, why did we (and likely you) miss it? Because it's hidden in plain sight.
The 25,000 token recommendation is:
- ✗ Not in the API reference
- ✗ Not in error messages
- ✗ Not shown in the Playground
- ✗ Not mentioned in migration guides
- ✓ Only in the reasoning guide (buried in prose)
Even worse, the failure mode gives you no indication of the root cause:
| What You See | What It Means |
|---|---|
| status: "incomplete" | Model didn't finish (but why?) |
| incomplete_details.reason: "max_output_tokens" | Ran out of tokens (200 should be enough for 50 words, right?) |
| output_text: "" | You get nothing (but you're still charged) |
| No exception raised | Your code continues as if it worked |
And the kicker: Claude and Gemini work fine with 200 tokens. Developers assume "if it worked before on other models, it should work now." Wrong assumption. Different architecture.
🔧 The Fix
Once we understood the problem, the solution was straightforward:
def get_max_output_tokens(model_name, baseline=200):
"""
Reasoning models need 20x higher token budgets.
OpenAI recommends 25k. We found 4k works for most tasks.
"""
if "gpt-5" in model_name or "o1" in model_name or "o3" in model_name:
return max(baseline, 4096) # ← 20× increase
return baseline
Result:
- Empty responses: 74% → 0%
- All prompts now work reliably
- Token cost increased ~3× per request
Better to spend $3 and get output than spend $1 and get nothing.
💡 The Economic Trade-off
Before fix (200 tokens):
- 74% of requests fail
- You're charged for all failed requests
- Users see errors, agents break
- You retry → waste more money → fail again
After fix (4096 tokens):
- 0% failures
- Cost per successful request: 3× higher
- But 100% success rate vs 26% before
- Net result: spend less, get more
⚠️ Why This Breaks Production Agents
The failure mode is silent and expensive:
- No error raised — Your code sees
status: 200 - Empty output — Parsers fail, agents break
- You still get charged — Input + reasoning tokens billed
- Retries make it worse — Each retry burns more tokens, returns empty again
Production Scenario
Your AI agent needs to extract structured data from a document. You call GPT-5-mini with max_output_tokens=500.
Cost: 3× the tokens. Result: Agent fails, user sees error.
🚨 What OpenAI Should Fix
This is a UX failure, not a technical bug. Here's what would help:
-
Return proper error on insufficient tokens
Current:
status: "incomplete", no exception, code continuesBetter: Clear error + guidance ("increase max_output_tokens to 4096+ for reasoning models") -
Set sane defaults — If user doesn't specify
max_output_tokens, auto-set to 4096+ for reasoning models - Show warning in Playground — When user selects o1/o3/GPT-5 with low token limit
- Document prominently — Move the 25k recommendation to API reference, not just reasoning guide
⚡ What to Do Today
Don't wait for this to break production. Three immediate steps:
Action Items
def get_max_output_tokens(model: str) -> int:
if model in ["gpt-5", "gpt-5-mini", "o1", "o3", "o4-mini"]:
return 4096 # Minimum for reasoning models
return 1024 # Legacy models
if response.status != "completed":
raise ModelIncompleteError(
f"Model: {response.incomplete_details}"
)
if len(response.output_text) == 0:
metrics.increment("llm.empty_response",
tags={"model": model})
# Alert if rate exceeds 5%
These three changes prevent the "stupid tax" and catch failures before they reach users.
📊 The Data
All of this data is publicly available with cryptographic proof of integrity. We can't fake it, and neither can anyone else.
Key Statistics (30 days)
💡 Lessons Learned
Three takeaways for production AI systems:
-
Read the docs (even the buried parts)
The 25k recommendation is in the reasoning guide, not the API reference. Easy to miss. -
Don't assume models are equivalent
Claude/Gemini working doesn't mean GPT-5 will. Architecture matters. -
Monitor empty response rates
We only caught this because we log everything. Without observability, this would silently drain budgets for months.
🛡️ How to Prevent This
Set appropriate token limits for reasoning models
OpenAI recommends 25,000 tokens. We found 4,096 works for most tasks.
max_output_tokens = 4096 # For GPT-5/o-series
Check response.status before parsing
Don't trust http_status: 200. Check the actual response status:
if response.status == "incomplete":
raise InsufficientTokensError(response.incomplete_details)
Monitor empty response rates in production
Track len(response.output_text) == 0 as a metric. Alert when it spikes.
Treat this as a production reliability metric: alert when it spikes, and investigate your model config + retry loops.
Use fallback models for critical paths
If GPT-5-mini returns empty, automatically retry with Claude or GPT-4.1 (non-reasoning model).
🎯 The Takeaway
GPT-5-mini's reasoning token architecture creates a configuration trap that 99% of production code falls into. It's not a bug—it's documented. But the documentation is scattered, the error messages are misleading, and the failure mode is silent.
Three Critical Lessons
-
1. Legacy code is a liability
GPT-3 era defaults (200-500 tokens) don't work with reasoning models. The 125× gap between "what worked" and "what's recommended" breaks silently. -
2. Models aren't equivalent
Claude and Gemini work fine with 200 tokens because their reasoning is internal. GPT-5's exposed reasoning tokens create fragility. Don't assume portability. -
3. You're paying for nothing
When max_output_tokens is too low, you're charged for input + reasoning but get zero output. This "stupid tax" is easily preventable with proper config—or automatic protection.
The fix is simple: use 4,096+ tokens for reasoning models. But the lesson is bigger: production AI systems need observability and automatic safeguards, not assumptions.
How OnceOnly helps (what it actually does)
OnceOnly doesn’t change your model token limits. It helps you keep production agents safe when failures trigger retries:
dedupe repeated actions (check-lock), enforce spend limits, and audit what happened.
Execution safety for production agents: idempotency, budgets, policies, audit.