Blackbox — Daily LLM Stability Observatory

Selected day

What failed today, in one view.

Provider comparison

Reliability, speed, and factual stability for the selected day.

Empty output trend

Strict prompt failures over time

Last 30 days. Lower is better.

Metric: complex prompt failure rate. Lower is better.

Other signals

Instruction following ℹ

Higher is better. Empty outputs lower the score.

Format compliance ℹ

Higher is better on strict output rules.

Factual stability ℹ

Higher is better on repeated fact checks.

Risk feed

Worst scenarios

Top 5 by worst empty rate.

Metric URLs

Proof

Aggregated metrics only. Each day links back to a canonical run root hash.

FAQ

What is a “no-text response”?

A request that “succeeds” (often 200 OK) but returns no usable output text. Your app might treat it as success and then crash later.

What does “complex prompt failure rate” mean?

It’s how often models return no text on prompts with strict constraints (e.g., “exactly 3 sentences”). These are common in production (JSON schemas, word limits, bullet counts).

What are A054–A057 codes?

Internal test IDs for specific constraint prompts. They’re stable identifiers so we can compare behavior over time.

What does “tamper‑evident” mean?

Each day’s metrics link back to a hash root of the underlying run. If someone edits history later, the proof won’t match.

Example (selected day)

OpenAI complex prompt no-text rate: — on —.