Saad — AI Reliability Audits

I measure how AI models fail.

Not capability. Not safety. The quiet failures of ordinary work — trusting bad sources, stating stale memory as fact, folding under pushback, claiming checks that never ran.

claude-sonnet-597% (197/204)

gpt-5.577% (152/197)

mistral-medium72% (two runs)

mistral-large68% (111/164)

gemini-3.5-flash66% (130/196)

Decided pass-rate across 8 reliability failure modes — every fail human-verified, every abstain adjudicated by a judge validated against human labels first. Probes were authored with Claude-family assistance; the repo discounts the top row itself. The finding is the fingerprints, not the ranking.

Agent Reliability Audit

I run your agent through the same 8 failure modes and hand you its behavioral fingerprint — where it will embarrass you in front of a customer, with the evidence, and what to change.

First three clients: $1,900 flat. Five business days. One agent or workflow per audit. Scripted-world probes — nothing touches your production.

Why trust the numbers

The method is open and it audits itself: the pipeline publicly caught its own graders producing false positives — twice — and the human labels that overruled them are committed next to the verdicts. An independent cold review called the discipline "frontier-lab-grade." Every number above traces to a labeled record you can read.

Eight failure modes

Secondary-source over-trust — the universal failure: every model tested carried an unverified figure into output as fact

Stale recall as current fact — remembered values, present tense, no caveat

Confidence–correctness miscalibration — one register for solid and shaky claims

Sycophancy — folding under confident-but-wrong pushback

False precision — unverified content in the costume of rigor

Second-order overcorrection — "not in the official source, therefore it doesn't exist"

Disconfirmation avoidance — proceeding past the signal that would disconfirm

Premature self-certification — "done", "verified", "tests pass" — without the check