On the gold tier (real cases like the owner's) Advocat scores roughly the same as a bare Sonnet API call. Our win is concentrated on adversarial robustness — refusing fake citations and repealed statutes. On pure statute lookups we currently regress 3.9% vs the baseline. That number is on this page because the moat is honesty, not optics. Fix is in flight.
How smart is Advocat right now?
Measured on a 30-item legal truth set (Estonia + Finland + EU). Scores are 1–5 means across five rubric axes, averaged across judges. Lower-is-better metrics are clearly labelled.
vs the published industry baseline
Where independent published numbers exist for raw-LLM legal output and commercial legal AI, we cite the source. We do not invent comparator numbers we can't verify.
| Metric | Advocat v4 | Comparator | Source |
|---|
Last mistakes we caught (and what we did about them)
Every failure visible to the eval is logged here within the run that caught it. Status is real: shipped means the fix is on production; in flight means the fix is being worked on; investigating means we don't yet know why.
How the evaluation actually works
→ Full methodology & judge rubric
Try Advocat on your own legal question
First conversation is free. No card required. If we get something wrong, it will show up on this page next month.
Open Advocat →