I measure how AI models fail.

Not capability. Not safety. The quiet failures of ordinary work — trusting bad sources, stating stale memory as fact, folding under pushback, claiming checks that never ran.

claude-sonnet-597% (197/204)
gpt-5.577% (152/197)
mistral-medium72% (two runs)
mistral-large68% (111/164)
gemini-3.5-flash66% (130/196)

Decided pass-rate across 8 reliability failure modes — every fail human-verified, every abstain adjudicated by a judge validated against human labels first. Probes were authored with Claude-family assistance; the repo discounts the top row itself. The finding is the fingerprints, not the ranking.

Agent Reliability Audit

I run your agent through the same 8 failure modes and hand you its behavioral fingerprint — where it will embarrass you in front of a customer, with the evidence, and what to change.

First three clients: $1,900 flat. Five business days. One agent or workflow per audit. Scripted-world probes — nothing touches your production.

Book a 30-min scoping call

Why trust the numbers

The method is open and it audits itself: the pipeline publicly caught its own graders producing false positives — twice — and the human labels that overruled them are committed next to the verdicts. An independent cold review called the discipline "frontier-lab-grade." Every number above traces to a labeled record you can read.

Eight failure modes