Loading…

Advocat AI Quality Scorecard

We're the first legal AI that publishes its own eval results — monthly, including the things we got wrong. No marketing numbers. No cherry-picking. Just what three LLM judges scored, with the failure tier left in.

v4.1 · current build 30 · truth-set items 3 · judges 2026-06-01 · next run
⚠ Honest reading of this page

On the gold tier (real cases like the owner's) Advocat scores roughly the same as a bare Sonnet API call. Our win is concentrated on adversarial robustness — refusing fake citations and repealed statutes. On pure statute lookups we currently regress 3.9% vs the baseline. That number is on this page because the moat is honesty, not optics. Fix is in flight.

How smart is Advocat right now?

Measured on a 30-item legal truth set (Estonia + Finland + EU). Scores are 1–5 means across five rubric axes, averaged across judges. Lower-is-better metrics are clearly labelled.

vs the published industry baseline

Where independent published numbers exist for raw-LLM legal output and commercial legal AI, we cite the source. We do not invent comparator numbers we can't verify.

Metric Advocat v4 Comparator Source

Last mistakes we caught (and what we did about them)

Every failure visible to the eval is logged here within the run that caught it. Status is real: shipped means the fix is on production; in flight means the fix is being worked on; investigating means we don't yet know why.

How the evaluation actually works

→ Full methodology & judge rubric

Try Advocat on your own legal question

First conversation is free. No card required. If we get something wrong, it will show up on this page next month.

Open Advocat →