Scorecard methodology

1. The five-axis rubric

Every answer Advocat produces is judged on five axes, each scored 1–5. The item's final score is the mean of axes, then averaged across judges. The same rubric applies to every item, regardless of tier (gold/statute/adversarial), so a "good" calibration score still requires the same evidence whether the question is a real case or a trap.

Axis	What 5/5 looks like	What 1/5 looks like
Citation validity citation_validity	Statute §, court name, and date all present and verifiable in Riigi Teataja / Finlex / EUR-Lex.	Hallucinated case name, made-up paragraph number, repealed law cited as if current.
Conclusion correctness conclusion_correctness	The legal conclusion is what a competent lawyer would advise; no material error.	Wrong outcome — would lose the case, miss the right, or take an unavailable procedural route.
Jurisdiction accuracy jurisdiction_accuracy	Correct legal system (EE/FI/EU), correct court layer (district / appeals / supreme), correct law family.	Confuses Estonian and Finnish procedure; routes to the wrong court level.
Completeness completeness	All actionable next steps named: deadline, document, addressee, fee, evidence.	Generic advice ("contact a lawyer") with no concrete step.
Calibration calibration	Hedges where uncertain; refuses fake cases; flags when out-of-domain.	Confidently confabulates on a trap. The cardinal sin.

2. The triple-LLM judge

No single model judges the candidate. We run the same answer past three independent judges:

Anthropic Claude Opus 4.5 — strongest legal reasoner in our internal comparisons.
OpenAI gpt-4o (with gpt-5 as fallback when gpt-4o rate-limits) — cross-vendor signal.
Google Gemini 2.5 Pro — third opinion. On the 2026-05-14 run Gemini was blocked (Generative Language API not enabled on our GCP project); the run continued with two judges and a noisier median. That blocker is on this page on purpose — fix is queued for next run.

Each judge gets the user question, the reference answer (written by a human lawyer for that item), the candidate answer, and the rubric. It returns a JSON score per axis with one-sentence evidence. Final item score = mean across judges. If two judges disagree by > 1.0 on the same axis, the item is flagged and reviewed manually before going on the scorecard.

Position-bias swap. Each item is judged twice — once with the human reference in position A and the Advocat answer in position B, and once swapped. Both passes contribute. This guards against the well-known tendency of LLM judges to favour whichever answer they see first.

3. The 30-item truth set

Three tiers of 10 items each. The exact items are not public — publishing them would let future model versions train on them, which would silently inflate scores. The structure of the set is public:

Gold tier (10 items) — anonymised facts from the owner's own real Finnish immigration case (Sulga v. Finland), plus four other real EE/FI cases the owner has worked. Reference answers were written by practicing lawyers in the relevant jurisdiction. The conflict of interest — owner uses own product on own case — is declared, and the gold tier is held apart from training data.
Statute tier (10 items) — direct lookups against Finlex / Riigi Teataja / EUR-Lex: "what does KrMS § 38 say?", "which statute regulates X?". Reference is the canonical paragraph text. These should be the easy wins. Currently the tier where we score worst.
Adversarial tier (10 items) — fake-case-name traps ("Tamm v. Kask 2019", invented), repealed-statute traps (TVL § 73 — repealed in 2009 but still in training-data soup), jurisdiction-shift traps (Estonian fact pattern presented as Finnish). The reference answer is either a refusal or a clear correction. This is where pipeline tricks earn their cost.

Languages and jurisdictions

Items are written in Estonian, Finnish, Russian, and English in proportion to real Advocat traffic. A Russian-language question about a Finnish statute is a separate item from the English version — cross-language retrieval is the area where we've seen the most silent regression.

4. What candidate is actually tested

The "Advocat v4" row in the scorecard is the live production endpoint — claude-proxy on the same Supabase project that serves real users. Same prompt, same RAG, same corrections memory, same adversarial critic. We do not test a "scorecard branch" that real users don't see.

The "Bare Sonnet" comparator is a raw Anthropic API call with a one-line "you are a legal assistant" system prompt — no RAG, no memory, no corrections. Both candidates run the same underlying model (claude-sonnet-4-20250514 on the 2026-05-14 run), so the delta isolates the pipeline contribution, not the model contribution.

5. Why we publish the failures

The 2024 Stanford RegLab "Hallucinating Law" paper measured commercial legal AI in a way the vendors themselves did not. Lexis+ AI hallucinated on 1 in 6 queries; the vendor had not published numbers, so users had no way to know. The paper made the numbers public, and trust moved.

We're choosing to do that to ourselves, monthly, before anyone else does it to us. The incentive: if the page is honest, "we got this wrong in May, fixed it in June" is a stronger trust signal than any benchmark we could pick. If the page were dishonest, the same paper methodology could come for us, and we'd lose the trust at once.

Stated commitments. (1) Every published mistake stays on the page until a future scorecard run shows the fix held. (2) When the model regresses on a previously-fixed item, that's a separate entry, not a quiet edit. (3) The truth set grows over time. Items are added when real Advocat users flag a wrong answer we couldn't catch internally. Items are not removed.

6. Cost and cadence

Run cost: $3.68 per full evaluation (candidate calls + judge calls). Cheap enough to run nightly; we currently run on major changes + monthly.
Cadence: Full re-evaluation at minimum once per calendar month. After every meaningful pipeline change. After every confirmed user-reported wrong answer.
Runtime: ~14 minutes wall-clock for a full 30-item run with all three judges.
Publication: Scorecard JSON is committed to the public repo (this site is a static deploy of that repo). You can see the history of any number by browsing the git history of scorecard-data.json.

7. How to challenge a score

Email support@advocat.ee with:

Which item or metric.
What you think the correct value is.
The source (statute, case, or methodology objection).

If you're right, the scorecard is updated with the next run and your challenge is acknowledged in the changelog row of scorecard-data.json. We do not pay for challenges, but we do publish who caught what (with permission).

8. Caveats we already know about

This is not an FDA approval, a legal certification, or a guarantee. A 3.0/5 rubric score does not mean "use this answer in court". Even our best item scores miss a statute § or a deadline on occasion. Advocat is a research tool for legal questions, not a substitute for a licensed lawyer in adversarial proceedings. The whole point of this page is to tell you exactly where the gaps are, in numbers, so you can decide if the tool is good enough for your situation.

The truth set is small (30 items). A 10-item tier with 1 outlier moves the tier mean by 0.1 — material at our current resolution. We are growing the set.
Three of the gold-tier items are anonymised from one real case (Sulga). That's a known sample-concentration risk.
LLM judges are not lawyers. They score against a human-written reference, which is the lawyer's role. If the reference is wrong, the score is wrong. References are versioned and challengable like everything else.
We have no comparator data for Finnish or Estonian commercial legal AI — none exists publicly. The Stanford numbers are US legal AI on US prompts. Direct comparison would be misleading.

Methodology version 1.0 · last revised 2026-05-14 · raw eval runs and judge prompts live in data/eval/ and eval/runs/ in the project repo.

How the scorecard is calculated

1. The five-axis rubric

2. The triple-LLM judge

3. The 30-item truth set

Languages and jurisdictions

4. What candidate is actually tested

5. Why we publish the failures

6. Cost and cadence

7. How to challenge a score

8. Caveats we already know about