At a glance
Provider comparison
Model reliability trend (last 30 days)
Metric: complex prompt failure rate
ℹ
. Lower is better.
Other trends (last 30 days)
instruction following (effective)
ℹ
format compliance
ℹ
factual stability
ℹ
External model risk feed (selected day)
These signals are designed for automation: they penalize empty outputs and quantify drift across key constraints.
Deep dive: which scenarios go empty?
Sorted by worst empty rate across providers (top 5).
Available metric URLs
Verification
This page shows aggregated metrics only (no raw text). Each day’s data has a canonical run root hash.
ℹ
FAQ
What is a “no-text response”?
A request that “succeeds” (often 200 OK) but returns no usable output text. Your app might treat it as success and then crash later.
What does “complex prompt failure rate” mean?
It’s how often models return no text on prompts with strict constraints (e.g., “exactly 3 sentences”).
These are common in production (JSON schemas, word limits, bullet counts).
What are A054–A057 codes?
Internal test IDs for specific constraint prompts. They’re stable identifiers so we can compare behavior over time.
What does “tamper‑evident” mean?
Each day’s metrics link back to a hash root of the underlying run. If someone edits history later, the proof won’t match.
Example (selected day)
OpenAI complex prompt no-text rate: —
on —.