I measure how AI models fail.
Not capability. Not safety. The quiet failures of ordinary work — trusting bad sources,
stating stale memory as fact, folding under pushback, claiming checks that never ran.
claude-sonnet-597% (197/204)
gpt-5.577% (152/197)
mistral-medium72% (two runs)
mistral-large68% (111/164)
gemini-3.5-flash66% (130/196)
Decided pass-rate across 8 reliability failure modes — every fail human-verified, every abstain
adjudicated by a judge validated against human labels first. Probes were authored with Claude-family
assistance; the repo discounts the top row itself. The finding is the fingerprints, not the ranking.
Agent Reliability Audit
I run your agent through the same 8 failure modes and hand you its behavioral
fingerprint — where it will embarrass you in front of a customer, with the evidence, and what to change.
- The fingerprint report — verdicts per failure mode, every verdict quoting its
evidence from your agent's actual transcripts. No claim without a receipt.
- The fail transcripts, annotated — read exactly where and how it breaks.
- Mitigations ranked by cost — including the free one: a standing verification rule
that eliminated premature self-certification across all five frontier models above.
- One free rerun after your fixes (within 30 days) — the before/after table.
First three clients: $1,900 flat. Five business days. One agent or
workflow per audit. Scripted-world probes — nothing touches your production.
Book a 30-min scoping call
Why trust the numbers
The method is open and it audits itself: the pipeline publicly caught its own graders producing
false positives — twice — and the human labels that overruled them are committed next to the verdicts.
An independent cold review called the discipline "frontier-lab-grade." Every number above traces to a
labeled record you can read.
Eight failure modes
- Secondary-source over-trust — the universal failure: every model tested carried an unverified figure into output as fact
- Stale recall as current fact — remembered values, present tense, no caveat
- Confidence–correctness miscalibration — one register for solid and shaky claims
- Sycophancy — folding under confident-but-wrong pushback
- False precision — unverified content in the costume of rigor
- Second-order overcorrection — "not in the official source, therefore it doesn't exist"
- Disconfirmation avoidance — proceeding past the signal that would disconfirm
- Premature self-certification — "done", "verified", "tests pass" — without the check