Selected day
What failed today, in one view.
Provider comparison
Reliability, speed, and factual stability for the selected day.
Empty output trend
Strict prompt failures over time
Last 30 days. Lower is better.
Metric: complex prompt failure rate. Lower is better.
Other signals
Instruction following
ℹ
Format compliance
ℹ
Factual stability
ℹ
Risk feed
Worst scenarios
Top 5 by worst empty rate.
Metric URLs
Proof
Aggregated metrics only. Each day links back to a canonical run root hash.
FAQ
What is a “no-text response”?
A request that “succeeds” (often 200 OK) but returns no usable output text. Your app might treat it as success and then crash later.
What does “complex prompt failure rate” mean?
It’s how often models return no text on prompts with strict constraints (e.g., “exactly 3 sentences”).
These are common in production (JSON schemas, word limits, bullet counts).
What are A054–A057 codes?
Internal test IDs for specific constraint prompts. They’re stable identifiers so we can compare behavior over time.
What does “tamper‑evident” mean?
Each day’s metrics link back to a hash root of the underlying run. If someone edits history later, the proof won’t match.
Example (selected day)
OpenAI complex prompt no-text rate: —
on —.