1. The five-axis rubric
Every answer Advocat produces is judged on five axes, each scored 1–5. The item's final score is the mean of axes, then averaged across judges. The same rubric applies to every item, regardless of tier (gold/statute/adversarial), so a "good" calibration score still requires the same evidence whether the question is a real case or a trap.
| Axis | What 5/5 looks like | What 1/5 looks like |
|---|---|---|
| Citation validity citation_validity |
Statute §, court name, and date all present and verifiable in Riigi Teataja / Finlex / EUR-Lex. | Hallucinated case name, made-up paragraph number, repealed law cited as if current. |
| Conclusion correctness conclusion_correctness |
The legal conclusion is what a competent lawyer would advise; no material error. | Wrong outcome — would lose the case, miss the right, or take an unavailable procedural route. |
| Jurisdiction accuracy jurisdiction_accuracy |
Correct legal system (EE/FI/EU), correct court layer (district / appeals / supreme), correct law family. | Confuses Estonian and Finnish procedure; routes to the wrong court level. |
| Completeness completeness |
All actionable next steps named: deadline, document, addressee, fee, evidence. | Generic advice ("contact a lawyer") with no concrete step. |
| Calibration calibration |
Hedges where uncertain; refuses fake cases; flags when out-of-domain. | Confidently confabulates on a trap. The cardinal sin. |
2. The triple-LLM judge
No single model judges the candidate. We run the same answer past three independent judges:
- Anthropic Claude Opus 4.5 — strongest legal reasoner in our internal comparisons.
- OpenAI gpt-4o (with gpt-5 as fallback when gpt-4o rate-limits) — cross-vendor signal.
- Google Gemini 2.5 Pro — third opinion. On the 2026-05-14 run Gemini was blocked (Generative Language API not enabled on our GCP project); the run continued with two judges and a noisier median. That blocker is on this page on purpose — fix is queued for next run.
Each judge gets the user question, the reference answer (written by a human lawyer for that item), the candidate answer, and the rubric. It returns a JSON score per axis with one-sentence evidence. Final item score = mean across judges. If two judges disagree by > 1.0 on the same axis, the item is flagged and reviewed manually before going on the scorecard.
3. The 30-item truth set
Three tiers of 10 items each. The exact items are not public — publishing them would let future model versions train on them, which would silently inflate scores. The structure of the set is public:
- Gold tier (10 items) — anonymised facts from the owner's own real Finnish immigration case (Sulga v. Finland), plus four other real EE/FI cases the owner has worked. Reference answers were written by practicing lawyers in the relevant jurisdiction. The conflict of interest — owner uses own product on own case — is declared, and the gold tier is held apart from training data.
- Statute tier (10 items) — direct lookups against Finlex / Riigi Teataja / EUR-Lex: "what does KrMS § 38 say?", "which statute regulates X?". Reference is the canonical paragraph text. These should be the easy wins. Currently the tier where we score worst.
- Adversarial tier (10 items) — fake-case-name traps ("Tamm v. Kask 2019", invented), repealed-statute traps (TVL § 73 — repealed in 2009 but still in training-data soup), jurisdiction-shift traps (Estonian fact pattern presented as Finnish). The reference answer is either a refusal or a clear correction. This is where pipeline tricks earn their cost.
Languages and jurisdictions
Items are written in Estonian, Finnish, Russian, and English in proportion to real Advocat traffic. A Russian-language question about a Finnish statute is a separate item from the English version — cross-language retrieval is the area where we've seen the most silent regression.
4. What candidate is actually tested
The "Advocat v4" row in the scorecard is the live production endpoint —
claude-proxy on the same Supabase project that serves real users.
Same prompt, same RAG, same corrections memory, same adversarial critic. We do not test a
"scorecard branch" that real users don't see.
The "Bare Sonnet" comparator is a raw Anthropic API call with a one-line "you are a legal
assistant" system prompt — no RAG, no memory, no corrections. Both candidates run the same
underlying model (claude-sonnet-4-20250514 on the 2026-05-14 run), so the delta
isolates the pipeline contribution, not the model contribution.
5. Why we publish the failures
The 2024 Stanford RegLab "Hallucinating Law" paper measured commercial legal AI in a way the vendors themselves did not. Lexis+ AI hallucinated on 1 in 6 queries; the vendor had not published numbers, so users had no way to know. The paper made the numbers public, and trust moved.
We're choosing to do that to ourselves, monthly, before anyone else does it to us. The incentive: if the page is honest, "we got this wrong in May, fixed it in June" is a stronger trust signal than any benchmark we could pick. If the page were dishonest, the same paper methodology could come for us, and we'd lose the trust at once.
6. Cost and cadence
- Run cost: $3.68 per full evaluation (candidate calls + judge calls). Cheap enough to run nightly; we currently run on major changes + monthly.
- Cadence: Full re-evaluation at minimum once per calendar month. After every meaningful pipeline change. After every confirmed user-reported wrong answer.
- Runtime: ~14 minutes wall-clock for a full 30-item run with all three judges.
- Publication: Scorecard JSON is committed to the public repo (this site is a static deploy of that repo). You can see the history of any number by browsing the git history of
scorecard-data.json.
7. How to challenge a score
Email support@advocat.ee with:
- Which item or metric.
- What you think the correct value is.
- The source (statute, case, or methodology objection).
If you're right, the scorecard is updated with the next run and your challenge is acknowledged
in the changelog row of scorecard-data.json. We do not pay for challenges, but we
do publish who caught what (with permission).
8. Caveats we already know about
- The truth set is small (30 items). A 10-item tier with 1 outlier moves the tier mean by 0.1 — material at our current resolution. We are growing the set.
- Three of the gold-tier items are anonymised from one real case (Sulga). That's a known sample-concentration risk.
- LLM judges are not lawyers. They score against a human-written reference, which is the lawyer's role. If the reference is wrong, the score is wrong. References are versioned and challengable like everything else.
- We have no comparator data for Finnish or Estonian commercial legal AI — none exists publicly. The Stanford numbers are US legal AI on US prompts. Direct comparison would be misleading.
Methodology version 1.0 · last revised 2026-05-14 · raw eval runs and judge prompts live in
data/eval/ and eval/runs/ in the project repo.