OnceOnly
Blackbox Observatory

Daily LLM Stability Observatory

Models drift. Apps break silently. This page tracks no-text responses, constraint adherence, and factual stability across OpenAI, Claude, and Gemini.

Date (UTC):

At a glance

Provider comparison

Model reliability trend (last 30 days)

Metric: complex prompt failure rate . Lower is better.

Other trends (last 30 days)

instruction following (effective)
format compliance
factual stability

External model risk feed (selected day)

These signals are designed for automation: they penalize empty outputs and quantify drift across key constraints.

Deep dive: which scenarios go empty?

Sorted by worst empty rate across providers (top 5).

Available metric URLs

Verification

This page shows aggregated metrics only (no raw text). Each day’s data has a canonical run root hash.

FAQ

What is a “no-text response”?
A request that “succeeds” (often 200 OK) but returns no usable output text. Your app might treat it as success and then crash later.
What does “complex prompt failure rate” mean?
It’s how often models return no text on prompts with strict constraints (e.g., “exactly 3 sentences”). These are common in production (JSON schemas, word limits, bullet counts).
What are A054–A057 codes?
Internal test IDs for specific constraint prompts. They’re stable identifiers so we can compare behavior over time.
What does “tamper‑evident” mean?
Each day’s metrics link back to a hash root of the underlying run. If someone edits history later, the proof won’t match.
Example (selected day)
OpenAI complex prompt no-text rate: on .